12.07.2015 Views

Algorithms for the Weighted Orthogonal Procrustes Problem and ...

Algorithms for the Weighted Orthogonal Procrustes Problem and ...

Algorithms for the Weighted Orthogonal Procrustes Problem and ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

UMINF-06.10 ISSN-0348-0542 ISBN 91-7264-052-9


PrefaceThis <strong>the</strong>sis consists of <strong>the</strong> following six papers.I. P. Å. Wedin <strong>and</strong> T. Vikl<strong>and</strong>s. <strong>Algorithms</strong> <strong>for</strong> 3-dimensional <strong>Weighted</strong><strong>Orthogonal</strong> <strong>Procrustes</strong> <strong>Problem</strong>s. Technical Report UMINF-06.06, Departmentof Computing Science, Umeå University, Umeå, Sweden, 2006.II. T. Vikl<strong>and</strong>s <strong>and</strong> P. Å. Wedin. <strong>Algorithms</strong> <strong>for</strong> Linear Least Squares <strong>Problem</strong>son <strong>the</strong> Stiefel manifold. Technical Report UMINF-06.07, Departmentof Computing Science, Umeå University, Umeå, Sweden, 2006.III. T. Vikl<strong>and</strong>s. On <strong>the</strong> Number of Minima to <strong>Weighted</strong> <strong>Orthogonal</strong> <strong>Procrustes</strong><strong>Problem</strong>s. Technical Report UMINF-06.08, Department of ComputingScience, Umeå University, Umeå, Sweden, 2006. Submitted <strong>for</strong>publication in BIT.IV. T. Vikl<strong>and</strong>s. On Global Minimization of <strong>Weighted</strong> <strong>Orthogonal</strong> <strong>Procrustes</strong><strong>Problem</strong>s. Technical Report UMINF-06.09, Department of ComputingScience, Umeå University, Umeå, Sweden, 2006. Submitted <strong>for</strong> publicationin BIT.V. T. Vikl<strong>and</strong>s. A Cubic Convergent Iteration Method. Technical ReportUMINF-05.10, Department of Computing Science, Umeå University,Umeå, Sweden, 2005.VI. T. Vikl<strong>and</strong>s <strong>and</strong> M. Gulliksson. Optimization Tools <strong>for</strong> Solving NonlinearIll-posed <strong>Problem</strong>s. Fast solution of discretized optimization problems(Berlin, 2000), 255–264, Internat. Ser. Numer. Math., 138, Birkhäuser,Basel, 2001.In Chapter 1 an introduction to <strong>the</strong> optimization problems considered is presentedalong with an overview of all papers. The papers are referred to by <strong>the</strong>irroman numbers.v


viii


Paper IV 109Paper V 143Paper VI 159Notations 175


Chapter 1Introduction <strong>and</strong> overviewThe main part of this <strong>the</strong>sis is about an optimization problem known as <strong>the</strong>weighted orthogonal <strong>Procrustes</strong> problem (WOPP), which we define as:Definition 1.0.1 With Q ∈ R m×n where n ≤ m, let A, X <strong>and</strong> B be knownreal matrices of compatible dimensions with rank(A) = m <strong>and</strong> rank(X) = n.Let || · || F denote <strong>the</strong> Frobenius matrix norm. The optimization problemminQ ||AQX − B||2 F , subject to Q T Q = I n , (1.0.1)is called a weighted orthogonal <strong>Procrustes</strong> problem.The Frobenius matrix norm can be regarded as <strong>the</strong> Euclidean norm <strong>for</strong>matrices. For vectors, <strong>the</strong> Euclidean norm is commonly known as <strong>the</strong> 2-norm,|| · || 2 . With a vector y ∈ R k , its Euclidean length is∑||y|| 2 = √ k yi 2.i=1For a matrix Y ∈ R m×n , <strong>the</strong> Frobenius norm ||Y || F ism∑ n∑||Y || F = √ yi,j 2 .i=1 j=1The WOPP is a linear least squares problem defined on a Stiefel manifold.A Stiefel manifold [30], commonly denoted V m,n , is <strong>the</strong> set of all matrices Q ∈R m×n , having orthonormal columns. V m,n is also referred to as <strong>the</strong> Stiefelmanifold of orthogonal n-frames in R m ,V m,n = {Q ∈ R m×n : Q T Q = I n }.A set of nonzero vectors {q 1 , ..., q n } in R m is said to be orthogonal if q T i q j =0 when i ≠ j. If additionally q T i q i = 1 (normalized), <strong>the</strong> set is said to beorthonormal [12].1


2 Chapter 1The definition of an orthogonal matrix is well known. A square matrixQ ∈ R m×m is said to be orthogonal if Q T Q = I m , hence QQ T = I m <strong>and</strong>Q T = Q −1 . This may seem a bit ambiguous since <strong>the</strong> columns of Q <strong>for</strong>m notjust an orthogonal basis but also an orthonormal basis. When speaking of anorthonormal matrix Q = [q 1 , ..., q n ], we mean that <strong>the</strong> columns of Q <strong>for</strong>m anorthonormal basis, i.e., Q T Q = I n . If n = m <strong>the</strong>n Q is orthogonal (a squareorthonormal matrix).Trough out this <strong>the</strong>sis, we assume that n ≤ m. Evidently Q T Q ≠ I n ifn > m.The WOPP can be regarded as a generalization of <strong>the</strong> orthogonal <strong>Procrustes</strong>problem (OPP).Definition 1.0.2 With Q ∈ R m×n where n ≤ m, let X with rank(X) = n<strong>and</strong> B be known real matrices of correct dimensions. We call <strong>the</strong> optimizationproblemmin ||QX −Q B||2 F , subject to Q T Q = I n , (1.0.2)an orthogonal <strong>Procrustes</strong> problem.The OPP has an analytical solution, that can be derived by using <strong>the</strong> singularvalue decomposition (SVD) of XB T . The WOPP on <strong>the</strong> o<strong>the</strong>r h<strong>and</strong>, typicallyneeds to be solved by iterative optimization algorithms.To derive <strong>the</strong> solution to (1.0.2), use <strong>the</strong> property that <strong>for</strong> a matrix Y ∈R m×n , ||Y || 2 F = tr(Y T Y ) where tr(·) is <strong>the</strong> matrix-trace,We can <strong>the</strong>n writen tr(Ỹ ) = ∑ỹ i,i , Ỹ = Y T Y.i=1||QX − B|| 2 F = tr((QX − B) T (QX − B)) == tr(X T Q T QX) − 2tr(QXB T ) + tr(B T B) = ||X|| 2 F − 2tr(QXB T ) + ||B|| 2 F,since Q T Q = I n . Solving (1.0.2) is <strong>the</strong>n done by maximizing tr(QXB T ). To doso, let UΣV T = XB T be a SVD, <strong>the</strong>ntr(QXB T ) = tr(QUΣV T ) = tr(V T QUΣ) = tr(ΣZ) =n∑σ i,i z i,i (1.0.3)where Z = V T QU. Since Z ∈ R m×n is orthonormal, any z i,j ≤ 1 <strong>for</strong> alli = 1, . . .,m <strong>and</strong> j = 1, . . .,n. Hence, <strong>the</strong> sum in (1.0.3) is maximized ifZ = I m,n , <strong>and</strong> <strong>the</strong> solution ˆQ to (1.0.2) is given by ˆQ = V I m,n U T . The OPP iswell studied, <strong>and</strong> is mentioned in introductory textbooks as, e.g., [12].As a simple <strong>and</strong> small example of a WOPP, consider <strong>the</strong> following: Find <strong>the</strong>minimum distance between <strong>the</strong> ellipsex = α 1 cosφ , y = α 2 sin φi=1


Introduction <strong>and</strong> overview 3x 2α 2TY ( ˆQ)rBx 1α 1Figure 1: Y (Q) is here an ellipse with semi major <strong>and</strong> semi minor axes α 1 <strong>and</strong>α 2 , respectively. The minimum distance occurs at Y ( ˆQ) where <strong>the</strong> residualr = B − Y ( ˆQ) is orthogonal to <strong>the</strong> tangent T of Y (Q) at ˆQ.<strong>and</strong> <strong>the</strong> point B according to Figure 1.With Q = [cosφ, sin φ] T we can express <strong>the</strong> ellipse as <strong>the</strong> vector valuedfunction[ ][ ]α1 0 cosφY (Q) == AQ.0 α 2 sin φFor vectors, <strong>the</strong> Frobenius norm is <strong>the</strong> same as <strong>the</strong> 2-norm. We can write <strong>the</strong>optimization problem as a WOPP (with X = 1)minQ ||AQ − B||2 2 , subject to Q T Q = 1. (1.0.4)Since n = 1 in this case, <strong>the</strong>re are no orthogonality constraints. The task is tofind <strong>the</strong> orthonormal matrix ˆQ (a normalized vector, Q T Q = 1), that minimizes<strong>the</strong> distance between <strong>the</strong> ellipse Y (Q) <strong>and</strong> <strong>the</strong> point B. A solution to (1.0.4) canbe computed by using iterative methods as, e.g., Newton’s method. Though, <strong>for</strong>this simple case of a WOPP, a solution can also be computed by solving a fourthdegree polynomial, see Paper I. There can also be two different minimizers <strong>for</strong>(1.0.4), see Section 3.1.Note that if α 1 = α 2 = χ, Y (Q) describes a circle with radii χ. Then bytaking X = χ, (1.0.4) can be written as an OPPminQ ||QX − B||2 2 , subject to QT Q = 1. (1.0.5)The solution ˆQ to (1.0.5) is unique, <strong>and</strong> is easily computed by taking ˆQ =B/||B|| 2 .


4 Chapter 1Though <strong>the</strong>se examples are very simple, <strong>the</strong>y illustrate <strong>the</strong> difficulty of computinga solution to a WOPP compared to an OPP.1.1 <strong>Procrustes</strong> problemsThere are several types of optimization problems involving Stiefel manifolds.One class of problems are different types of <strong>Procrustes</strong> problems, arising ina wide area of applications. Commonly, <strong>the</strong> term <strong>Procrustes</strong> analysis or <strong>Procrustes</strong>rotation is used instead of <strong>Procrustes</strong> problems. For example, orthogonal<strong>Procrustes</strong> analysis <strong>and</strong> weighted <strong>Procrustes</strong> rotation to address <strong>the</strong> OPP <strong>and</strong>WOPP, respectively.The name <strong>Procrustes</strong> comes from <strong>the</strong> Greek mythology. <strong>Procrustes</strong> (<strong>the</strong>Stretcher) was a robber <strong>and</strong> torturer that had an iron bed in which he desiredto put his victims. To make <strong>the</strong>m fit <strong>the</strong> bed he cut of <strong>the</strong>ir limbs or alternativelystretched <strong>the</strong>m out. In <strong>the</strong> end, karma came to <strong>Procrustes</strong> as he was fitted inhis own bed by Theseus.1.1.1 Rigid body movementsThe ellipse problem discussed is to <strong>the</strong> least very simple, due to its low dimension(m = 2 <strong>and</strong> n = 1). To give an example of a WOPP of larger dimension, weconsider <strong>the</strong> problem of determining a rigid body movement.Consider a rigid body with n l<strong>and</strong>marks x 1 , ..., x n in R 3 that is subject toa translation t ∈ R 3 , <strong>and</strong> a rotation M ∈ R 3×3 , taking <strong>the</strong> l<strong>and</strong>marks into <strong>the</strong>positions c 1 , . . .,c n . In rigid body applications, commonly M (<strong>for</strong> Motion) isused to represent an orthogonal matrix.x 1Mx i + tx 2x 3c 1c 2c 3Figure 2: A rigid body with three l<strong>and</strong>marks undergoing a rotation <strong>and</strong> translation.The motion of <strong>the</strong> rigid body can be written asMx i + t = b iwhere M, with det(M) = 1, describes <strong>the</strong> rotations around <strong>the</strong> three axes.Given x 1 , . . .,x n <strong>and</strong> c 1 , . . .,c n , <strong>the</strong> rotation M can be computed by solving an


Introduction <strong>and</strong> overview 5orthogonal <strong>Procrustes</strong> problem (OPP) as follows [29]. Let X = [x 1 −¯x, ..., x n −¯x]<strong>and</strong> C = [c 1 − ¯c, ..., c n − ¯c] where ¯x <strong>and</strong> ¯c are <strong>the</strong> mean value vectors of x i <strong>and</strong>c i , i = 1, ..., n, <strong>the</strong>n M is given by solvingminM ||MX − C||2 F , subject to MT M = I 3 , det(M) = 1. (1.1.1)The solution is given by using <strong>the</strong> SVD of XC T , <strong>and</strong> if XC T is nonsingular <strong>the</strong>solution is unique.Suppose now that <strong>the</strong> accuracies of <strong>the</strong> l<strong>and</strong>marks x i <strong>and</strong> x i , i = 1, . . . , n aredifferent, dependent on <strong>the</strong> coordinate axes (in R 3 ). Let us say, that along <strong>the</strong>third axis (z-axis), we have a noticeable lower accuracy than <strong>for</strong> <strong>the</strong> first <strong>and</strong>second axes (x- <strong>and</strong> y-axes). It is <strong>the</strong>n preferred to give <strong>the</strong>se z-axis coordinatesa lesser impact when computing a solution. To do this we can weight <strong>the</strong> OPPby using a weighting matrix A. For instance, letA =⎡⎣ 1 0 00 1 00 0 αwhere 0 < α < 1 is a suitable chosen scalar. The weighted residual is <strong>the</strong>nA(MX − C) <strong>and</strong> we get a WOPP on <strong>the</strong> <strong>for</strong>mminM ||A(MX − C)||2 F , subject to M T M = I 3 , det(M) = 1. (1.1.2)Observe that by taking B = AC, (1.1.2) is on <strong>the</strong> <strong>for</strong>m stated in Definition 1.0.1.Commonly iterative optimization algorithms are used to compute a solutionto (1.1.2). Moreover, a WOPP can have several minima, which leads to <strong>the</strong>problem of deciding if a computed solution is <strong>the</strong> ”best” one.It can also be desired to weight <strong>the</strong> OPP from right. Assume that <strong>the</strong>accuracy when measuring some, specific, l<strong>and</strong>marks is worse than <strong>the</strong> average.Then by constructing a diagonal matrix W, we can give <strong>the</strong>se l<strong>and</strong>marks a lowweight as (MX − C)W. The right-weighted OPP <strong>the</strong>n becomesminM ||(MX − C)W ||2 F , subject to M T M = I 3 , det(M) = 1.By taking X := XW <strong>and</strong> B = CW, we see that solving this problem is doneby solving an OPP.1.1.2 PsychometricsThe OPP originates from factor analysis in psychometrics in <strong>the</strong> 1950s <strong>and</strong>1960s, e.g., [16,18]. The task is to determine an orthogonal matrix Q ∈ R m×m ,that rotates a factor (data) matrix A, to fit some hypo<strong>the</strong>sis matrix B. Typicallyin psychometrics, <strong>the</strong> points to be rotated are ordered row wise in A, not columnwise in X as with <strong>the</strong> rigid body movement example shown above. We denote<strong>the</strong> rotation of A as Y (Q) = AQ. Hence, given A <strong>and</strong> B, we wish to find Qsuch that Y (Q) ≈ B.⎤⎦ ,


6 Chapter 1When using <strong>the</strong> Euclidean distance to measure <strong>the</strong> distance between <strong>the</strong>rotation Y (Q) <strong>and</strong> B, <strong>the</strong> optimal orthogonal matrix Q is given by solvingminQ ||AQ − B||2 F , subject to Q T Q = I m . (1.1.3)Since n = m, resulting in that Q is an orthogonal matrix, (1.1.3) is an OPPwith a solution that can be derived by using <strong>the</strong> SVD of B T A.A more common <strong>for</strong>mulation of <strong>the</strong> OPP, in psychometrics, is with <strong>the</strong> usageof <strong>the</strong> matrix-trace,minQ tr(AQ − B)T (AQ − B) , subject to Q T Q = I m .Extensions of (1.1.3) were later on considered. The case when Q is an orthonormalmatrix, i.e., Q ∈ R m×n with n ≤ m was considered in, e.g., [5]. GivenA <strong>and</strong> B, it is desired to find Q such that Y (Q) ≈ B but now with Q T Q = I n .In [5], to measure <strong>the</strong> similarity of <strong>the</strong> two matrices Y <strong>and</strong> B, <strong>the</strong> degree ofcollinearity of <strong>the</strong> rows yiT <strong>and</strong> b T i in Y <strong>and</strong> B respectively is used. Hence, <strong>the</strong>solution ˆQ is computed frommax ∑ iy T i b i = max tr(Y B T ) = maxtr(AQB T ) , subject to Q T Q = I n ,(1.1.4)by using <strong>the</strong> SVD UΣV T = B T A, yielding ˆQ = V I m,n U T . This solution iscomputed in a similar manner as <strong>for</strong> an OPP. ˆQ is not necessary <strong>the</strong> same as<strong>the</strong> solution given when <strong>the</strong> Euclidean distance is used, to measure distancesbetween points in Y <strong>and</strong> B. Then we getminQ ||AQ − B||2 F , subject to QT Q = I n . (1.1.5)By using <strong>the</strong> matrix-trace, we write <strong>the</strong> objective function in (1.1.5) as||AQ − B|| 2 F = tr(QT A T AQ) − 2tr(AQB T ) + B T B == tr(AQQ T A T ) − 2tr(AQB T ) + B T B.If n = m, <strong>the</strong>n QQ T = I m <strong>and</strong> (1.1.5) becomes an OPP whose solution iscomputed by maximizing tr(AQB T ), just as in (1.1.4). The differences occurwhen n < m, since <strong>the</strong>n QQ T ≠ I n , <strong>and</strong> <strong>the</strong> term tr(AQQ T A T ) becomesdependent on Q. To illustrate this, we can look at <strong>the</strong> example in [5]. There<strong>the</strong> hypo<strong>the</strong>tical factor matrices to be matched are,A =⎡⎢⎣0.76 0.32 0.50.5 0.5 −0.40.52 −0.36 0.50.5 −0.5 −0.4⎤ ⎡⎥⎦ , B = ⎢⎣0.7 0.10.8 00.1 0.70 0.8⎤⎥⎦ .


Introduction <strong>and</strong> overview 7The solution to 1.1.4 is⎡ˆQ = ⎣0.7444 0.66200.6651 −0.74660.0582 0.0657while <strong>the</strong> solution to (1.1.3) is⎡¯Q = ⎣0.7385 0.65700.6656 −0.7462−0.1073 −0.1076⎤⎦ , Y ( ˆQ) =⎤⎦ , Y ( ¯Q) =⎡⎢⎣⎡⎢⎣0.8077 0.29700.6815 −0.06860.1768 0.64590.0164 0.67800.7206 0.20670.7450 −0.00160.0907 0.55650.0794 0.7446The discrepancies here, <strong>for</strong> <strong>the</strong> two different solutions ˆQ <strong>and</strong> ¯Q, are || ˆQ− ¯Q|| F =0.2398, ||Y ( ˆQ) − B|| F = 0.3052 <strong>and</strong> ||Y ( ¯Q) − B|| F = 0.2119.It can also be desirable to weight ei<strong>the</strong>r <strong>the</strong> columns or <strong>the</strong> rows in <strong>the</strong>residual AQ − B, as described above in <strong>the</strong> rigid body movement example.When weighting of <strong>the</strong> columns in AQ − B, we getminQ tr(AQ − B)T ˜W 2 (AQ − B) , subject to Q T Q = I n , (1.1.6)<strong>and</strong> <strong>the</strong> weighting of <strong>the</strong> rows in AQ − B isminQ tr(AQ − B) ¯W 2 (AQ − B) T , subject to Q T Q = I n , (1.1.7)where ˜W <strong>and</strong> ¯W are known diagonal weighting matrices, [20,22]. If n = m, <strong>the</strong>n(1.1.6) becomes an OPP, in a similar way to <strong>the</strong> right-weighted OPP in Section1.1.1. Equation (1.1.7) can be written as a WOPP according to Definition 1.0.1,by taking B := B ¯W <strong>and</strong> X = ¯W.1.1.3 The OPP <strong>and</strong> WOPPAreas where <strong>the</strong> WOPP (<strong>and</strong> OPP) arise are in applications related to, e.g.,rigid body movement <strong>and</strong> psychometrics as mentioned, factor analysis [15,23],multivariate analysis <strong>and</strong> multidimensional scaling [6,13], global positioning system[2]. Typically, it is about computing a matrix with orthonormal columns,when it is desired to match one set of data to ano<strong>the</strong>r.As mentioned earlier, <strong>the</strong> solution to a WOPP can not be computed aseasily as <strong>for</strong> an OPP. Additionally, a WOPP can have several local minima.Hence a solution, computed by some iterative method, is not necessarily a globaloptimum. The <strong>for</strong>mulation (1.0.1) has sometimes been referred to as <strong>the</strong> Penroseregression problem 1 .1 It is not clear from where this originates. It seems as <strong>the</strong> first time <strong>the</strong> term ”Penroseregression” was used was in an older version (technical report) of [4] from 1997. Lars Eldénin<strong>for</strong>med me that Penrose studied <strong>the</strong> best approximation of <strong>the</strong> matrix equation AXC = Bwhere A, X, C <strong>and</strong> B are any general matrices [26].⎤⎥⎦ ,⎤⎥⎦ .


Introduction <strong>and</strong> overview 91.2 Linear Least squares problems on <strong>the</strong> StiefelmanifoldIn this <strong>the</strong>sis, we also consider optimization problems on <strong>the</strong> <strong>for</strong>m,minQ ||f(Q) − b||2 2 , subject to Q ∈ V m,n , (1.2.1)where f(Q) ∈ R k is a vector valued function of Q <strong>and</strong> b ∈ R k a known vector.We restrict ourselves to <strong>the</strong> cases when f(Q) is linear in Q.Ano<strong>the</strong>r way of writing (1.2.1) ismin ||f(Q) −Q b||2 2 (1.2.2){subject to qi T 0 if i ≠ jq j =(1.2.3)1 o<strong>the</strong>rwise.There are two types of constraints <strong>for</strong> this problem. The orthogonality constraintthat q i ⊥q j whenever i ≠ j <strong>and</strong> <strong>the</strong> normalizing constraint, ||q i || = 1 <strong>for</strong>all i.The algorithms presented in Paper I <strong>and</strong> Paper II are based on <strong>the</strong> <strong>for</strong>mulationaccording to (1.2.1). Those can be used to compute a solution to aWOPP. To write a WOPP on <strong>the</strong> <strong>for</strong>m given in (1.2.1), we make use of <strong>the</strong>Kronecker product ⊗ <strong>and</strong> <strong>the</strong> vec-operator. The (i, j) block of <strong>the</strong> Kroneckerproduct X ⊗ A of two matrices X <strong>and</strong> A is x i,j A. vec(Q) is a stacking of <strong>the</strong>columns in Q = [q 1 , ..., q n ] ∈ R m×n into a vector⎡vec(Q) = ⎢⎣⎤q 1q 2..⎥q n⎦ ∈ Rmn .Let Y (Q) = AQX, <strong>the</strong>n by using <strong>the</strong> vec-operator on Y <strong>and</strong> B we getf(Q) = vec(Y ) = [X T ⊗ A]vec(Q)b = vec(B).The function Y (Q) ∈ R m×n is now embedded in R mn .Generally, any matrix function Y (Q) that is linear in Q, can be expressedas a vector valued function f(Q) = vec(Y (Q)) = Fvec(Q) where F ∈ R k×mn .


10 Chapter 1


Chapter 2<strong>Algorithms</strong> <strong>for</strong> least squaresproblems on <strong>the</strong> Stiefelmanifold2.1 The 3-dimensional WOPP, Paper IIn Paper I, we present an algorithm to solve <strong>the</strong> 3-dimensional WOPP (1.1.2).As a parametrization of M, <strong>the</strong> Cayley trans<strong>for</strong>m C(S) of a skew-symmetricmatrix S = −S T ∈ R 3×3 is used,C(S) = (I + S)(I − S) −1 .The algorithm uses Newton or Gauss-Newton search directions <strong>and</strong> due to <strong>the</strong>geometry of <strong>the</strong> problem, optimal step lengths can be computed very simply.The weighted case of (1.1.1) has also been specially studied by o<strong>the</strong>rs [1].A poster of this work was presented at <strong>the</strong> First SIAM-EMS Conference”AMCW” 2001, Berlin, September 2-6, 2001.2.2 Linear least squares problems on <strong>the</strong> Stiefelmanifold, Paper IIIn Paper II, we consider <strong>the</strong> least squares problemminQ12 ||f(Q) − b||2 2 , subject to Q ∈ V m,n, (2.2.1)where f(Q) ∈ R k can be written as f(Q) = Fvec(Q) with F ∈ R k×mn <strong>and</strong>rank(F) = min(k, mn). There are some requirements on <strong>the</strong> matrix F, though.Suppose Q is parameterized with p parameters, <strong>the</strong>n if k < p <strong>the</strong> optimizationproblem is under-determined, in fact <strong>the</strong> Jacobian of f(Q) will not have full11


12 Chapter 2column rank. Even if k ≥ p it is not guaranteed that (2.2.1) is not underdetermined.We illustrate this with a small example. Take Q = [q 1 , q 2 , q 3 ] ∈R 3×3 <strong>the</strong>n p = 3 is needed to represent Q. But with F = [I 3 , Z, Z] ∈ R 3×9where Z ∈ R 3×3 is a zero matrix, we see that f(Q) is <strong>the</strong>n independent of q 2<strong>and</strong> q 3 due to multiplication with zeros, i.e.,⎡f(Q) = [ I 3 Z Z ] ⎣ q ⎤1q 2⎦ = q 1 + Zq 2 + Zq 3 = q 1 .q 2Hence it is necessary that F corresponds to a sufficiently amount of data suchthat (2.2.1) is well-posed. The algorithm in Paper II can be used to someextent to solve some undetermined problems, but it is not developed to h<strong>and</strong>le<strong>the</strong>m. The best way to deal with an under-determined (rank-deficient, ill-posed)problem is to do a re<strong>for</strong>mulation, if possible. As <strong>for</strong> <strong>the</strong> example given, it wouldbe better to re<strong>for</strong>mulate <strong>the</strong> problem with Q ∈ R 3×1 <strong>and</strong> F = diag(1, 1, 1) ∈R 3×3 .The algorithm in Paper II started out as a generalization of <strong>the</strong> algorithm inPaper I. Hence <strong>the</strong> Cayley trans<strong>for</strong>m was used to parameterize Q ∈ R m×n . Since<strong>the</strong> Cayley trans<strong>for</strong>m only works <strong>for</strong> orthogonal matrices, i.e., when m = n,a slight modification of it was needed to comply with <strong>the</strong> unbalanced caseswhen n < m. Also this algorithm uses Newton or Gauss-Newton methodsto get a descent direction. Optimal step lengths could not be computed aseasily as in Paper I. However, later on it was found out that by using <strong>the</strong>matrix exponential of a skew-symmetric matrix S, exp(S), instead of <strong>the</strong> Cayleytrans<strong>for</strong>m to parameterize Q ∈ R m×n , optimal step lengths could be computedra<strong>the</strong>r simply. The choice of exp(S) results in a similar algorithm as when using<strong>the</strong> Cayley trans<strong>for</strong>m.Parts of this work was presented at <strong>the</strong> 18th International Symposium onMa<strong>the</strong>matical Programming (ISMP), Copenhagen, August 18-22, 2003.


Chapter 3Global minimization of aWOPPAs mentioned, a WOPP can have several minima. The task of finding <strong>the</strong> ”best”minimizer is a global optimization problem. In order to know if a computedminimizer is a global minimizer, a sufficient condition <strong>for</strong> global optimum isdesired. That is, a condition that is true <strong>for</strong> a global minimizer but false <strong>for</strong>any local minimum. Deriving such a condition is not an easy task. In [9], anecessary condition <strong>for</strong> global optimum is presented. If this condition fails <strong>for</strong>a computed solution ˆQ, <strong>the</strong>n ˆQ is a local minimum. If <strong>the</strong> necessary conditionis true <strong>for</strong> ˆQ, <strong>the</strong>n ˆQ can ei<strong>the</strong>r be a local or global minimum.Ano<strong>the</strong>r way to classify if a minimizer ˆQ is a global optimum, is to computeall minima to <strong>the</strong> problem <strong>and</strong> <strong>the</strong>n check which of those minima that yields <strong>the</strong>least objective function value. In order to do so it is useful to know how manyminima <strong>the</strong> problem might have. The studies done in Paper III <strong>and</strong> Paper IV,presented below, leads to <strong>the</strong> following conjecture.Conjecture 3.0.1 The weighted orthogonal <strong>Procrustes</strong> problemminQ ||AQX − B||2 F , subject to QT Q = I n ,has at most 2 n unconnected minima.By unconnected minima, we mean that <strong>the</strong> minima are distinct. For somespecial cases, <strong>the</strong>re can be a continuum of minimizers. This can be illustratedby considering <strong>the</strong> ellipse example shown earlier. Let A = I 2 <strong>and</strong> let B = 0 (<strong>the</strong>origin), <strong>the</strong>n any Q ∈ V 2,1 is a minimum.13


14 Chapter 33.1 The number of minima to a WOPP, PaperIIIPaper III contains a study on <strong>the</strong> number of minima to a WOPP. As a simpleexample of a WOPP with more than one minimizer, consider <strong>the</strong> ellipse problemmentioned earlier. As seen in Figure 1 a local minimum occurs in <strong>the</strong> fourthquadrant. What determines if <strong>the</strong> optimization problem has one or two minimais <strong>the</strong> flatness of ellipse along with <strong>the</strong> magnitude <strong>and</strong> direction of b. For a veryflat ellipse where α 1 >> α 2 it is more likely that a local minimum can exist asopposed to an ”almost circular” ellipse with α 1 ≈ α 2 . If α 1 = α 2 <strong>the</strong>n f(Q)is a circle <strong>and</strong> only one minimizer ˆQ = b/||b|| exist if b ≠ 0. Consider now <strong>the</strong>yTf( ˆQ)bxf( ˆQ 2 )Figure 1: Two minima are found where <strong>the</strong> tangent of f(Q) is orthogonal to <strong>the</strong>distance vector from b to <strong>the</strong> ellipse (<strong>the</strong> residual r = b − f(Q)). The minimumin <strong>the</strong> first quadrant is global.case when b = 0. If α 1 = α 2 any Q ∈ V 2,1 is a minimizer, i.e., a continuumof solutions arises. Roughly speaking we can say that we consider a continuumof solutions as one minimizer. However, if α 1 > α 2 <strong>the</strong> WOPP always has twodistinct, unconnected, minima at ˆQ = [0, ±1] T . This reasoning applies to anyWOPP with Q ∈ R m×1 , which we call <strong>the</strong> ellipsoid cases since <strong>the</strong> surface off(Q) = AQX is a hyper ellipsoid in R m . In this instance <strong>the</strong>re can at most betwo unconnected minima to <strong>the</strong> WOPP <strong>for</strong> any b ∈ R m , see Paper III.The global minimum ˆQ to an ellipsoid case must fulfill <strong>the</strong> conditionsign(f i ( ˆQ)) = sign(b i ) ∀ i = 1, . . .,m.That is, f( ˆQ) must lie in <strong>the</strong> same area as b, that is divided by <strong>the</strong> coordinateplanes. For example consider Q ∈ R 2 , <strong>the</strong>n f( ˆQ) <strong>and</strong> b must be lie <strong>the</strong> samequadrant <strong>and</strong> <strong>for</strong> Q ∈ R 3 <strong>the</strong>y must be in <strong>the</strong> same octant <strong>and</strong> so on.


Global minimization of a WOPP 15For a WOPP of general dimension <strong>the</strong> geometry becomes more complicatedthan in <strong>the</strong> ellipsoid cases. Never<strong>the</strong>less, <strong>the</strong> surface of f(Q) still have ellipticproperties. For instance we can move from one point f( ˜Q) to ano<strong>the</strong>r pointf( ¯Q) by following ellipses on <strong>the</strong> surface of f(Q). When studying <strong>the</strong> maximumnumber of minima to a WOPP, using b = 0 (B = 0) is a naturally first approach.Also it is preferred to consider that α i > α i+1 <strong>and</strong> χ j > χ j+1 <strong>for</strong> all i =1, 2, . . ., m − 1 <strong>and</strong> j = 1, 2, . . .,n−1. Compare this to <strong>the</strong> task of determining<strong>the</strong> number of eigenvectors to a matrix E ∈ R m×m . If we consider, e.g., <strong>the</strong>identity matrix E = I ∈ R m×m , any x ∈ R m with ||x|| = 1 is an eigenvector.But if E is a diagonal matrix according to E = diag(ɛ 1 , . . . , ɛ m ) where ɛ i ≠ ɛ j<strong>for</strong> all i ≠ j, <strong>the</strong>n <strong>the</strong> number of eigenvectors of E are finite.However, in Paper III it is shown that a WOPP with Q ∈ R m×n has 2 nminima when B = 0. Empirical studies indicate that this is a valid upper bound<strong>for</strong> <strong>the</strong> maximal number of minima to a WOPP.3.2 Computing all minimizers, Paper IVConsider a very flat ellipse with α 1 >> α 2 according to Figure 2, with ˆb lyingin <strong>the</strong> first quadrant. The solution ˆQ is <strong>the</strong>n also in <strong>the</strong> first quadrant. Nowlet b = ˆb + δb be a perturbed measurement, as shown. Due to <strong>the</strong> flatness of<strong>the</strong> ellipse, <strong>the</strong> global minimum now is in <strong>the</strong> fourth quadrant. In this case <strong>the</strong>local minimum is a better approximation of <strong>the</strong> ”correct” solution ˆQ, than <strong>the</strong>global minimum. This can also occur <strong>for</strong> higher dimensional problems. Hencecomputing all, or some, solutions to a WOPP could be to prefer.ˆbbFigure 2:In Paper IV, an algorithm to compute all minima to a WOPP is presented.To explain how this algorithm works we can again take <strong>the</strong> ellipsoid cases.Assume that <strong>the</strong> local minimum ˆQ 2 in Figure 1 is computed. To get a ˜Q thatis in <strong>the</strong> vicinity of <strong>the</strong> global minimum ˆQ we can use <strong>the</strong> normal N of f(Q)at Q = ˆQ 2 . The residual r = b − f( ˆQ 2 ) coincides with <strong>the</strong> normal direction.Consider <strong>the</strong> function f( ˆQ)+Nγ where γ is a scalar <strong>and</strong> N <strong>the</strong> normal at f( ˆQ).Computing <strong>the</strong> intersection of f( ˆQ) + Nγ <strong>and</strong> <strong>the</strong> ellipse yields a Q 2 in <strong>the</strong>vicinity of ˆQ. Now Q 2 is a good initial value <strong>for</strong> an iterative method to compute


16 Chapter 3ˆQ. This method is roughly <strong>the</strong> same <strong>for</strong> any ellipsoid case since <strong>the</strong> normal isuniquely defined.Nf( ˆQ)f(Q 2 )bf( ˆQ 2 )Figure 3: The normal plane N (dashed line) at f( ˆQ 2 ) intersects <strong>the</strong> surface off(Q) at f(Q 2 ), in <strong>the</strong> vicinity of f( ˆQ).From special studies of Q ∈ R 2×2 <strong>and</strong> Q ∈ R 3×2 it seems as if this normalplane method is a good method to use <strong>for</strong> computing all minimizers. Since<strong>the</strong> surface of f(Q) is ”built up by ellipses”, intuitively this method shouldbe viable <strong>for</strong> a WOPP of a general dimension. In <strong>the</strong>se cases, computing <strong>the</strong>intersections of <strong>the</strong> normal plane <strong>and</strong> <strong>the</strong> surface of f(Q) is done by computingall solutions to a set of quadratic equations. These equations can be <strong>for</strong>mulatedas a continuous algebraic Riccati equation (CARE), a well known quadraticmatrix equation, [21,27]. This equation has 2 n roots, <strong>the</strong> same number as <strong>the</strong>estimated maximal number of minima to a WOPP. Empirical studies shows thatthis method manages to compute all, or at least several, minima to <strong>the</strong> WOPPsconsidered.A presentation of this work was held at <strong>the</strong> 18th International Symposiumon Ma<strong>the</strong>matical Programming (ISMP), Copenhagen, August 18-22, 2003.


Chapter 4Quadratic equationsIn Paper V, an iteration method <strong>for</strong> solving F(x) = 0 where F(x) ∈ R m <strong>and</strong> x ∈R m is presented. It exhibits cubic convergence by using second order in<strong>for</strong>mation(derivatives) in each step. O<strong>the</strong>r methods using second order in<strong>for</strong>mation are<strong>the</strong> Chebyshev method [19], <strong>and</strong> Halley’s method [25]. The implementation of<strong>the</strong>se in R m <strong>for</strong> m > 1 from a ”practical linear algebraic” point of view, doesnot seem easy. As an example, <strong>the</strong> Chebyshev method (in Banach spaces) isoften presented as in [19],x k+1 = x k − (I + 1 2 F ′ (x k ) −1 F ′′ (x k )F ′ (x k ) −1 F(x k ))F ′ (x k ) −1 F(x k ).How is <strong>the</strong> second order derivative F ′′ (x) (usually represented as a tensor) multipliedwith <strong>the</strong> inverse of <strong>the</strong> Jacobian F ′ (x k ) −1 ? Presentations of <strong>the</strong>se methodsseem too abstract. It is suspected that ma<strong>the</strong>maticians are not interestedin how <strong>the</strong> multiplication is per<strong>for</strong>med, <strong>the</strong>y are satisfied with <strong>the</strong> knowledge ofthat it is possible. In [7], a straight<strong>for</strong>ward <strong>and</strong> practical presentation of Halley’smethod in R m is given. Paper V also contains <strong>the</strong> writers interpretationof Halley’s method in several variables.However, to use higher order in<strong>for</strong>mation usually is computationally heavy.But <strong>for</strong> a quadratic problem this is not always <strong>the</strong> case. The second orderderivatives of quadratic equations are constant, hence <strong>the</strong> computational costwill not grow that much. A small experimental study of this is done in PaperV, indicating an increased efficiency <strong>for</strong> quadratic problems with less than 15parameters.Finally, to prove cubic convergence <strong>for</strong> <strong>the</strong> method presented in Paper Vwhen m > 1, a ra<strong>the</strong>r cumbersome <strong>and</strong> messy tensor arithmetic was invented<strong>and</strong> used. The paper [31] contains a short in<strong>for</strong>mal note about this tensorarithmetic.17


18 Chapter 4


Chapter 5Ill-posed problemsAn inverse problem is <strong>the</strong> task of, e.g., determining some parameters x in amodel (function) f(x) by using some observed data b where b ≈ f(x). Typicallywe can write a solution ˆx as ˆx = f −1 (b), if <strong>the</strong> problem is well-posed. Thedefinition of well-posedness was set up by Hadamard in <strong>the</strong> beginning of <strong>the</strong>20th century as [10,17],a) For all admissible data, a solution exist.b) For all admissible data, <strong>the</strong> solution is unique.c) The solution depends continuously on <strong>the</strong> data.If any of <strong>the</strong> above properties does not hold, <strong>the</strong> problem is said to be illposed.In connection to c), are ill-conditioned problems. A problem is calledill-conditioned if a small perturbation in b yields a large perturbation of <strong>the</strong>solution ˆx.Consider a nonlinear optimization problems as, e.g.,minx||f(x) − b|| 2 2 , (5.0.1)where f(x) ∈ R m is a nonlinear function of <strong>the</strong> parameters x ∈ R n to bedetermined, <strong>and</strong> b ∈ R m corresponds to some input data (measurements). Here(5.0.1) is ill-conditioned in <strong>the</strong> sense that any small perturbation ∆b in b mayresult in a large perturbation of <strong>the</strong> solution ˆx of (5.0.1).A commonly used approach <strong>for</strong> solving <strong>the</strong>se types of optimization problemsis Tikhonov regularizationminx||f(x) − b|| 2 2 + λ||L(x − x c)|| 2 2 , (5.0.2)where λ > 0 is called <strong>the</strong> regularization parameter, x c <strong>the</strong> center of regularization<strong>and</strong> L a known weighting matrix. For simplicity, consider L as <strong>the</strong> identitymatrix, <strong>the</strong>n <strong>the</strong> regularized problem (5.0.2) corresponds to determining a ˆλ <strong>and</strong>a solution ˆx(ˆλ) such that ||ˆx − x c || 2 is not large <strong>and</strong> ||f(ˆx) − b|| 2 is somewhat19


20 Chapter 5optimal. How should this be done ? Given some x c <strong>and</strong> a priori knowledgeabout <strong>the</strong> magnitude of <strong>the</strong> noise level, e.g., ||∆b|| 2 ≤ δ b , <strong>the</strong>n it would be anidea to <strong>for</strong>mulate (5.0.2) asminx||x − x c || 2 2 , subject to ||f(x) − b|| 2 2 = δ b . (5.0.3)This is known as <strong>the</strong> discrepancy principle. We could also assume that <strong>for</strong> agiven x c , a solution ˆx should fulfill ||ˆx − x c || 2 ≤ δ x where δ x is known. Then asolution could be computed by solvingminx||f(x) − b|| 2 2 , subject to ||x − x c || 2 2 ≤ δ x . (5.0.4)5.1 The L-curve <strong>for</strong> nonlinear problems, PaperVIHow should (5.0.1) or (5.0.2) be solved without having any a priori knowledge ?A quite popular method <strong>for</strong> linear least squares problems is <strong>the</strong> L-curve method[17]. The L-curve is given by plotting <strong>the</strong> solution norm ||x(λ)−x c || as a functionof <strong>the</strong> residual ||f(x(λ)) − b||, in a log-log scale, <strong>for</strong> different values of λ. Theidea is to pick a solution ”in <strong>the</strong> corner” of <strong>the</strong> L-curve, since it is <strong>the</strong>re that<strong>the</strong> solution norm starts to grow drastically.In Paper VI a ra<strong>the</strong>r specialized investigation of <strong>the</strong> L-curve method inconnection to nonlinear problems is done. Nonlinear problems can be extremelydifferent from each o<strong>the</strong>r as opposed to linear problems. Hence <strong>the</strong> L-curvesoften vary in shape <strong>and</strong> similarities to <strong>the</strong> L-curve <strong>for</strong> linear problems can beuncommon.Additionally, <strong>the</strong> center of regularization x c plays a bigger role <strong>for</strong> nonlinearproblems. Without any additional in<strong>for</strong>mation, it can be hard to motivate thata solution ˆx is better than x c itself just because ˆx yields a smaller residual norm.


Chapter 6Software6.1 WOPP softwareThe algorithms presented in Paper I, Paper II <strong>and</strong> Paper IV have been implementedin MATLAB. Though <strong>the</strong> algorithm in Paper II also manages <strong>the</strong> 3-dimensional case described in Paper I, a special routine <strong>for</strong> algorithm in Paper Iis available. Also included are routines using <strong>the</strong> Cayley trans<strong>for</strong>m parametrization,instead of <strong>the</strong> matrix exponential function.Details regarding <strong>the</strong> software can be found athttp://www.cs.umu.se/˜vikl<strong>and</strong>s/WOPP/index.html.6.2 L-curve toolboxInitially our goal was to develop a toolbox <strong>for</strong> Tikhonov regularization of nonlinearoptimization problems. In a similar manner as Hansen [17] did <strong>for</strong> linearproblems. Due to <strong>the</strong> difficulties that arise with nonlinearity (convergence aspects,how to choose <strong>and</strong> update regularization parameters, time consumingcomputations, etc.), this turned out to be a difficult task. The black box typealgorithm presented in Paper IV can <strong>and</strong> probably will fail, if <strong>the</strong> nonlinearfunction f(x) is replaced by some o<strong>the</strong>r ”general” function (coming from a differentapplication). There<strong>for</strong>e no toolbox software has been made public, yet.21


22 Chapter 6


Chapter 7Research biography <strong>and</strong>reflectionsMy first years of research (1999-2001) were focused on algorithms <strong>for</strong> solving illposednonlinear optimization problems, by using Tikhonov regularization (5.0.2)<strong>and</strong> <strong>the</strong> L-curve. Since no specific, real, application was considered, this gaverise to some problems. We mostly regarded f(x) ∈ R m as an ill-conditionedfunction of <strong>the</strong> parameters x ∈ R n , <strong>and</strong> <strong>the</strong> weighting matrix L as <strong>the</strong> identitymatrix, i.e.,min ||f(x) − b|| 2x2 + λ||x − x c|| 2 2 . (7.0.1)Here we get something like a parameter estimation problem, which I nowthink is a bad approach. I figure it would be better to consider problems (applications)where x is a function instead, e.g., x(t) where t ∈ R. Then applyinga smoothness condition on x(t) by using a weighting matrix L.Also, how should <strong>the</strong> center of regularization x c be treated ? Let us considertwo choices of how to treat x c .1. x c is a fix (known) point, that should not be changed as, e.g., a prioriin<strong>for</strong>mation.2. x c is just an arbitrary point (initial value) ”in <strong>the</strong> vicinity” of <strong>the</strong> solution.Roughly speaking, it’s not that important.Our approach was to consider x c as in item 2, <strong>and</strong> <strong>the</strong>n as I see it, a smalldilemma turns up. By using <strong>the</strong> L-curve method, suppose we compute a solutionˆx with some not-so-important x c . Is ˆx really a better solution than x c itself,just because it gives a smaller residual norm ? Seems hard to state that, unlesssome additional in<strong>for</strong>mation is provided.If we now state that ˆx is a good solution to <strong>the</strong> problem, <strong>the</strong>n by takingx c = ˆx <strong>and</strong> solving <strong>the</strong> problem again with <strong>the</strong> L-curve, we get a new solution¯x. The new solution ¯x is not necessarily <strong>the</strong> same as <strong>the</strong> first computed solutionˆx. This does not feel right. In my opinion, having computed a solution ˆx by some23


24 Chapter 7method with an arbitrary x c , <strong>and</strong> <strong>the</strong>n using x c = ˆx <strong>and</strong> solving <strong>the</strong> problemagain should result in that <strong>the</strong> computed solution still is ˆx (if we disregardproblems arising due to finite arithmetic). This is where <strong>the</strong> L-curve methodfails, but <strong>the</strong> discrepancy principle does not.However, I do not think <strong>the</strong> L-curve (or o<strong>the</strong>r heuristic methods) to solvenonlinear ill-posed problems are a waste, but I do believe a specific real applicationis needed. The test problems considered as, e.g., NMR spectroscopy <strong>and</strong>heat transfer equations, were purely artificial. Doing research with a ”general”nonlinear function f(x), i.e., having a black box approach, seemed ra<strong>the</strong>r hopelessat that time. Even though <strong>the</strong> area of ill-posed problems is more popular<strong>and</strong> huge compared to <strong>the</strong> area of WOPP, I think it was a good decision that Ileft it. In my opinion, research on deriving good or reasonable solutions to anill-posed problem is more connected to statistics than developing optimizationalgorithms.The first time I came in contact with a WOPP was during a course inoptimization related to rigid body movement, held by my supervisor Per-ÅkeWedin in <strong>the</strong> year 2000. Having only some basic knowledge about <strong>the</strong> commonOPP, I became interested in <strong>the</strong> fact that a WOPP can have several minimawhile an OPP has an unique minimizer (if BX T is nonsingular). Slowly myresearch became less focused on ill-posed problems, <strong>and</strong> instead turned towards<strong>the</strong> WOPP.When we developed <strong>the</strong> algorithms in described Paper I <strong>and</strong> Paper II, Imade a lot numerical tests. It was noted that <strong>the</strong>re were some pattern between<strong>the</strong> different minima to a WOPP. For instance, if ˆQ 1 <strong>and</strong> ˆQ 2 are two minima,<strong>the</strong>y can look quite similar apart from some + or − signs on different elements.By studying some special <strong>and</strong> low-dimensional cases, I found out that using <strong>the</strong>normal plane, to compute additional minima, seemed as a good method. Theresult was <strong>the</strong> normal plane algorithm presented in Paper IV.Be<strong>for</strong>e <strong>the</strong> connection to <strong>the</strong> CARE was discovered in Paper IV, a heuristicmethod was used to compute all solutions to <strong>the</strong> quadratic equations (computationof <strong>the</strong> normal plane intersections). It was early noted that <strong>the</strong>re alwaysseemed to be 2 n solutions. This was later seen to be <strong>the</strong> exact number of solutions,when <strong>the</strong> CARE <strong>for</strong>mulation was used. However, to compute all 2 nsolutions Newton’s method with r<strong>and</strong>om initial values was used until all solutionswere found. To speed this up, a new iteration method was consideredPaper V. In connection to this work, I became interested in tensor representationsof higher order derivatives. But it felt as if I was too far away from whatI was supposed to be working with. Hence only a small note about <strong>the</strong> subjectwas done [31], <strong>and</strong> I turned back to <strong>the</strong> WOPP.By empirical studies I noted that <strong>the</strong> maximal number of minima to a WOPPseemed to be given by <strong>the</strong> <strong>for</strong>mula 2 n . Also if B was small, more minima likelyoccurred. This resulted in a special study of <strong>the</strong> case when B = 0 in PaperIII. Initially when studying <strong>the</strong> WOPP with B = 0, <strong>the</strong> Cayley trans<strong>for</strong>mparametrization was used. This resulted in ra<strong>the</strong>r long <strong>and</strong> badly arrangedequations. In [9] Eldén <strong>and</strong> Park uses <strong>the</strong> Lagrangian <strong>for</strong>mulation of <strong>the</strong> WOPP,which inspired me. By using that <strong>for</strong>mulation, <strong>the</strong> equations become more


Research biography <strong>and</strong> reflections 25<strong>for</strong>eseeable, but <strong>the</strong> proofs in Paper III can presumably be done different <strong>and</strong>simpler.The WOPP is not a ”well used”, so to speak, optimization problem as <strong>the</strong>OPP. Very little has been published about its real world applications. Is it dueto that since <strong>the</strong> OPP is very easy to solve, people do not want to complicate<strong>the</strong> problem by turning it into a WOPP, even though a weighting or such isdesirable ? I do not know. However, I hope that some of this work can beuseful <strong>for</strong> present <strong>and</strong> future persons, wishing to solve <strong>and</strong> do research about<strong>the</strong> WOPP <strong>and</strong> least squares problems defined on Stiefel manifolds.


References[1] P. G. Batchelor <strong>and</strong> J. M. Fitzpatrick. A Study of <strong>the</strong> Anisotropically<strong>Weighted</strong> <strong>Procrustes</strong> <strong>Problem</strong>. IEEE Workshop on Ma<strong>the</strong>matical Methodsin Biomedical Image Analysis (MMBIA’00), page 212, 2000.[2] T. Bell. Global Positioning System-Based Attitude Determination <strong>and</strong> <strong>the</strong><strong>Orthogonal</strong> <strong>Procrustes</strong> <strong>Problem</strong>. Journal of Guidance, Control, <strong>and</strong> Dynamics,26(5):820–822, 2003.[3] M. T. Chu <strong>and</strong> N. T. Trendafilov. On a Differential Equation Approach to<strong>the</strong> <strong>Weighted</strong> <strong>Orthogonal</strong> <strong>Procrustes</strong> <strong>Problem</strong>. Statistics <strong>and</strong> Computing,8(2):125–133, 1998.[4] M. T. Chu <strong>and</strong> N. T. Trendafilov. The <strong>Orthogonal</strong>ly Constrained RegressionRevisted. J. Comput. Graph. Stat., 10:746–771, 2001.[5] N. Cliff. <strong>Orthogonal</strong> rotation to congruence. Psychometrika, 31(1):33–42,1966.[6] T. F. Cox <strong>and</strong> M. A. A. Cox. Multidimensional scaling. Chapman & Hall,1994.[7] A. A. M. Cuyt <strong>and</strong> L. B. Rall. Computational Implementation of <strong>the</strong>Multivariate Halley Method <strong>for</strong> Solving Nonlinear Systems of Equations.ACM Transactions on Ma<strong>the</strong>matical Software, 11(1):20–36, 1985.[8] A. Edelman, T. A. Arias, <strong>and</strong> S. T. Smith. The Geometry of <strong>Algorithms</strong>with <strong>Orthogonal</strong>ity Constraints. SIAM Journal on Matrix Analysis <strong>and</strong>Applications, 20(2):303–353, 1998.[9] L. Eldén <strong>and</strong> H. Park. A <strong>Procrustes</strong> problem on <strong>the</strong> Stiefel manifold.Numer. Math., 82(4):599–619, 1999.[10] H. W. Engl, M. Hanke, <strong>and</strong> A. Neubauer. Regularization of Inverse <strong>Problem</strong>s.Kluwer Academic Publishers, 1996.[11] W. G<strong>and</strong>er. Least Squares with a Quadratic Constraint. Numer. Math.,36:291–307, 1981.26


REFERENCES 27[12] G. H. Golub <strong>and</strong> C. F. Van Loan. Matrix Computations. The Johns HopkinsUniversity Press, 1989.[13] J. C. Gower. Multivariate Analysis: Ordination, Multidimensional Scaling<strong>and</strong> Allied Topics. H<strong>and</strong>book of Applicable Ma<strong>the</strong>matics, VI:Statistics(B),1984.[14] J. C. Gower. <strong>Orthogonal</strong> <strong>and</strong> projection procrustes analysis. In W. J.Krzanowski, editor, Recent Advances in Descriptive multivariate analysis.Ox<strong>for</strong>d University Press, Ox<strong>for</strong>d, 1995.[15] J. C. Gower <strong>and</strong> G. B. Dijksterhuis. <strong>Procrustes</strong> problems. Ox<strong>for</strong>d UniversityPress, 2004.[16] B. F. Green. The orthogonal approximation of an oblique simple structurein factor analysis. Psychometrika, 17:429–440, 1952.[17] P. C. Hansen. Rank-Deficient <strong>and</strong> Discrete Ill-Posed <strong>Problem</strong>s. SIAM,1998.[18] J. R. Hurley <strong>and</strong> R. B. Cattell. The procrustes program: Producing directrotation to test a hypo<strong>the</strong>sized factor structure. Behavioural Science, 6:258–262, 1962.[19] M. A. Hernández J. M. Gutiérrez. New Recurrence Relations <strong>for</strong> Chebyshevmethod. Appl. Math. Lett., 10(2):63–65, 1997.[20] M. A. Koschat <strong>and</strong> D. F. Swayne. A Weig<strong>the</strong>d <strong>Procrustes</strong> Criterion. Psychometrika,56(2):229–239, 1991.[21] P. Lancaster <strong>and</strong> L. Rodman. The Algebraic Riccati Equation. Ox<strong>for</strong>dUniversity Press, 1995.[22] R. W. Lissitz, P. H. Schönemann, <strong>and</strong> J. C. Lingoes. A solution to <strong>the</strong>weighted procrustes problem in which <strong>the</strong> trans<strong>for</strong>mation is in agreementwith <strong>the</strong> loss function. Psychometrika, 41(4):547–550, 1976.[23] W. Meredith. On weighted procrustes <strong>and</strong> hyperplane fitting in factoranalytic rotation. Psychometrika, 42(4):491–522, 1977.[24] A. Mooijaart <strong>and</strong> J. J. F. Comm<strong>and</strong>eur. A General Solution of <strong>the</strong> Weig<strong>the</strong>dOrthonormal <strong>Procrustes</strong> <strong>Problem</strong>. Psychometrika, 55(4):657–663, 1990.[25] Ortega <strong>and</strong> Rheinholdt. Iterative Solution of Nonlinear Equations in SeveralVariables. Academic Press, 1970.[26] R. Penrose. A Generalized Inverse <strong>for</strong> Matrices. Proc. Cambridge Philos.Soc., 51:406–413, 1955.[27] J. E. Potter. Matrix Quadratic Solutions. J. SIAM Appl. Math., 14(3):496–501, 1966.


28[28] T. Rapcsak. On Minimization on Stiefel manifolds. European J. Oper.Res., 143(2):365–376, 2002.[29] I. Söderkvist <strong>and</strong> Per-Åke Wedin. On Condition Numbers <strong>and</strong> <strong>Algorithms</strong><strong>for</strong> Determining a Rigid Body Movement. BIT, 34:424–436, 1994.[30] E. Stiefel. Richtungsfelder und Fernparallelismus in n-dimensionalen Mannigfaltigkeiten.Commentarii Math. Helvetici, 8:305–353, 1935-1936.[31] T. Vikl<strong>and</strong>s. A Note on Representations of Derivative Tensors of VectorValued Functions. Technical Report UMINF-05.14, Department of ComputingScience, Umeå University, Umeå, Sweden, 2005.


Paper I<strong>Algorithms</strong> <strong>for</strong> 3-dimensional <strong>Weighted</strong><strong>Orthogonal</strong> <strong>Procrustes</strong> <strong>Problem</strong>s ∗Per-Åke Wedin <strong>and</strong> Thomas Vikl<strong>and</strong>s †Department of Computing Science, Umeå UniversitySE-901 87 Umeå, Sweden.pwedin@cs.umu.se, vikl<strong>and</strong>s@cs.umu.seAbstractA weighted orthogonal <strong>Procrustes</strong> problem can be written as <strong>the</strong> minimizationof ||AMX −B|| 2 F, subject to M T M = I, where A, X <strong>and</strong> B areknown matrices <strong>and</strong> M is to be determined. In this paper, we consider<strong>the</strong> special case when M ∈ R 3×3 with det(M) = 1. This is typically <strong>the</strong>case in applications related to rigid body movement. Our approach is to<strong>for</strong>mulate <strong>the</strong> problem as min ||f(M) − b|| 2 2 subject to M T M = I, wheref(M) is a linear function of M <strong>and</strong> b a vector. The Cayley trans<strong>for</strong>m isused to make a local parametrization of M. The iterative Newton basedalgorithm presented, finds <strong>the</strong> closest point on an ellipse on <strong>the</strong> surface off(M) in each step.Keywords : <strong>Weighted</strong>, orthogonal, <strong>Procrustes</strong>, ellipses, rigid body movement,Cayley trans<strong>for</strong>m.∗ From UMINF-06.06, 2006.† Financial support has partly been provided by <strong>the</strong> Swedish Foundation <strong>for</strong> Strategic Researchunder <strong>the</strong> frame program grant A3 02:128.31


32 Paper IContents1 Introduction 332 Choice of coordinate system 343 Moving on <strong>the</strong> surface of f(M) 363.1 Computing <strong>the</strong> step length . . . . . . . . . . . . . . . . . . . . . 383.1.1 The 2 dimensional ellipse problem . . . . . . . . . . . . . 394 WOPP algorithm <strong>for</strong> M ∈ R 3×3 405 Computational experiments 405.1 Tests with a relative type of perturbation . . . . . . . . . . . . . 425.2 Tests with an isotropic type of perturbation . . . . . . . . . . . . 435.3 Summary of computational results . . . . . . . . . . . . . . . . . 44A The canonical <strong>for</strong>m of a WOPP 45B The solution to an OPP 45C Tables <strong>for</strong> <strong>the</strong> tests with a relative type of perturbation 47D Tables <strong>for</strong> <strong>the</strong> tests with an isotropic type of perturbation 50E Finding a condition <strong>for</strong> a global minimizer 51References 53


<strong>Algorithms</strong> <strong>for</strong> 3-dimensional <strong>Weighted</strong> <strong>Orthogonal</strong> <strong>Procrustes</strong> <strong>Problem</strong>s 331 IntroductionConsider a body with n l<strong>and</strong>marks x 1 , ..., x n in R 3 that is subject to a translation<strong>and</strong> rotation, taking <strong>the</strong> l<strong>and</strong>marks into <strong>the</strong> positions b 1 , ..., b n . This can bewritten asMx i + t = b i ,where M ∈ R 3×3 is an orthogonal matrix with det(M) = 1, describing rotationsaround <strong>the</strong> three axes, <strong>and</strong> t ∈ R 3 is a translation. Given x 1 , ..., x n <strong>and</strong> b 1 , ..., b n ,<strong>the</strong> rotation matrix M can be computed by solving an orthogonal <strong>Procrustes</strong>problem (OPP) [7]. Let X = [x 1 − ¯x, ..., x n − ¯x] <strong>and</strong> B = [b 1 − ¯x, ..., b n −¯b] where¯x = (x 1 + . . . + x n )/n <strong>and</strong> ¯b = (b 1 + . . . + b n )/n. Then M is given by solvingminM ||MX − B||2 F , subject to MT M = I, det(M) = 1. (1)The solution is derived from <strong>the</strong> singular value decomposition of XB T , seeTheorem B.1 in Appendix B.Assume now that a weighting A of <strong>the</strong> residual MX − B in (1) is desired,<strong>the</strong>n (1) becomes a weighted orthogonal <strong>Procrustes</strong> problem (WOPP)minM ||A(MX − B)||2 F , subject to MT M = I, det(M) = 1, (2)or with B := AB we can write this asminM ||AMX − B||2 F , subject to M T M = I, det(M) = 1. (3)We can assume that A <strong>and</strong> X are 3 by 3 diagonal matrices defined as A =diag(α 1 , α 2 , α 3 ) <strong>and</strong> X = diag(χ 1 , χ 2 , χ 3 ) where α 1 ≥ α 2 ≥ α 3 > 0 <strong>and</strong> χ 1 ≥χ 2 ≥ χ 3 > 0, see Appendix A. We call this <strong>the</strong> canonical <strong>for</strong>m of a WOPP. Thelast strict inequality assumptions, α 3 > 0 <strong>and</strong> χ 3 > 0, means that we retain<strong>the</strong> relevance of M being an orthogonal matrix. The <strong>for</strong>mulation (3) has beenstudied, e.g., in [1,2].A solution to a WOPP can not be derived in a similar manner as <strong>for</strong> anOPP, where <strong>the</strong> singular value decomposition is most helpful. Typically, aniterative method is needed. Additionally, (3) can have local minimizers. Henceit is possible that a solution given by some iterative algorithm is not necessarilya global minimum. In Appendix E a condition (26) <strong>for</strong> a global minimum isstated. However, from a practical point of view, having computed a solution ˆQto (3), it is very hard to verify if ˆQ is a global minimum by using (26).An equivalent <strong>for</strong>mulation of (3) isminM12 ||f(M) − b||2 2 , subject to MT M = I, det(M) = 1, (4)where f(M) = Fvec(M), F = X T ⊗A <strong>and</strong> b = vec(B). Here ⊗ is <strong>the</strong> Kroneckerproduct <strong>and</strong> vec(M) is a stacking of <strong>the</strong> columns in M = [m 1 , m 2 , m 3 ] to <strong>for</strong>ma vector as⎡vec(M) = ⎣ m ⎤1m 2⎦ ∈ R 9 .m 3


34 Paper IThis paper proposes an iterative algorithm to solve a WOPP based on <strong>the</strong><strong>for</strong>mulation given in (4). As a parametrization of M <strong>the</strong> Cayley trans<strong>for</strong>m isused. The algorithm uses <strong>the</strong> first <strong>and</strong> second order Taylor expansion of f(M)to get Newton or Gauss-Newton search directions. Given a search direction <strong>the</strong>algorithm moves from one point ˜M to ano<strong>the</strong>r point by following ellipses on <strong>the</strong>3-dimensional surface of f(M) in R 9 .2 Choice of coordinate systemOne way of representing M as an orthogonal matrix, is by using plane rotationmatrices. That is, to write M as a matrix productM = P 1 (φ 1 )P 2 (φ 2 )P 3 (φ 3 ),where each P i (φ i ) ∈ R 3×3 , i = 1, 2, 3, corresponds to a plane rotation around<strong>the</strong> x, y <strong>and</strong> z axes in R 3 .Ano<strong>the</strong>r way of representing M is with a skew symmetric 1 matrix S ∈ R 3×3 .Every orthogonal matrix M, in a neighborhood of a given orthogonal matrix˜M, can be written asM(s) = ˜MC(s). (5)Here C(s) = (I + S)(I − S) −1 is <strong>the</strong> Cayley trans<strong>for</strong>m of S. With s =[s 1 , s 2 , s 3 ] T ∈ R 3 we express⎡S(s) = ⎣0 −s 1 −s 2s 1 0 −s 3s 2 s 3 0<strong>and</strong> referring to S or s is essential <strong>the</strong> same. An orthogonal matrix has ei<strong>the</strong>ra determinant of 1 or −1, corresponding to a rotation or a reflection. TheCayley trans<strong>for</strong>m is defined <strong>for</strong> orthogonal matrices with positive determinantonly. Let ˆM be a solution to (4) <strong>and</strong> consider <strong>the</strong> sequence M0 , M 1 , ..., M kwhere M k → ˆM as k → ∞. If using a parametrization according to (5) <strong>the</strong>ndet(M i ) = det(M i+1 ), hence det(M 0 ) = 1 must be chosen.At a given point ˜M, by using <strong>the</strong> power series expansionwe can write M(s) <strong>and</strong> f(M(s)) as<strong>and</strong>⎤⎦,(I − S) −1 = I + S + S 2 + S 3 + ...M(s) = ˜M(I + 2S + 2S 2 + 2S 3 + ...) (6)f(M(s)) = f( ˜M) + f( ˜M2S) + f( ˜M2S 2 ) + ... = f( ˜M) + Js + ... (7)1 S is skew-symmetric if S = −S T .


<strong>Algorithms</strong> <strong>for</strong> 3-dimensional <strong>Weighted</strong> <strong>Orthogonal</strong> <strong>Procrustes</strong> <strong>Problem</strong>s 35Using <strong>the</strong> first order approximation of f(M(s)) in (4) yieldsmins||Js + f( ˜M) − b|| 2 ,<strong>and</strong> <strong>the</strong> search direction corresponding to <strong>the</strong> Gauss-Newton method iss GN = −J + (f(M 0 ) − b). (8)The Jacobian J ∈ R 9×3 can be expressed with column vectors according toJ = 2 [ f( ˜M(e 2 e T 1 − e 1 e T 2 )) f( ˜M(e 3 e T 1 − e 1 e T 3 )) f( ˜M(e 3 e T 2 − e 2 e T 3 )) ] .The full Newton search direction to (4) is given bys N = −(J T J + H) −1 (J T (f( ˜M) − b)) (9)where H ∈ R 3×3 is a symmetric matrix containing second order derivatives. Toexpress H we use <strong>the</strong> term f( ˜M2S 2 ) <strong>and</strong> let ω denote ω = f( ˜M) − b. Then weget9∑H = ω i ∇ 2 s f i( ˜M2S 2 ).S 2 is symmetric since<strong>and</strong> has <strong>the</strong> appearance⎡S 2 =i=1(S 2 ) T = (S T ) 2 = (−S) 2 = (−1) 2 S 2 = S 2⎤⎣ −(s2 1 + s 2 2) −s 2 s 3 s 1 s 3−s 2 s 3 −(s 2 1 + s2 3 ) −s 1s 2⎦.s 1 s 3 −s 1 s 2 −(s 2 2 + s2 3 )To simplify <strong>the</strong> <strong>for</strong>thcoming derivations, S 2 is written as a sum of 6 matricesT ij , with i = 1, 2, 3 <strong>and</strong> i ≤ j ≤ 3 according to⎡S 2 = s 2 ⎣ −1 0 0 ⎤ ⎡1 0 −1 0 ⎦ + s 1 s 2⎣ 0 0 0 ⎤ ⎡0 0 −1 ⎦ + s 1 s 3⎣ 0 0 1 ⎤0 0 0 ⎦0 0 0 0 −1 0 1 0 0+s 2 2With⎡⎣ −1 0 00 0 00 0 −1⎤ ⎡⎦ + s 2 s 3⎣ 0 −1 0−1 0 00 0 0⎤⎦ + s 2 3⎡⎣ 0 0 00 −1 00 0 −1= s 2 1 T 11 + s 1 s 2 T 12 + s 1 s 3 T 13 + s 2 2 T 22 + s 2 s 3 T 23 + s 2 3 T 33.H can <strong>the</strong>n be expressed as⎡H = 2h ij = f( ˜MT ij ),⎤⎣ 2ωT h 11 ω T h 21 ω T h 31ω T h 21 2ω T h 22 ω T h 32⎦.ω T h 31 ω T h 32 2ω T h 33⎤⎦ =


36 Paper I3 Moving on <strong>the</strong> surface of f(M)To move from a point f( ˜M) to ano<strong>the</strong>r point can be done by following an ellipseon <strong>the</strong> surface of f(M). This is achieved by writingM = ˜M ˜M T M = ˜M(I + S)(I − S) −1 ,<strong>and</strong> do a decomposition of (I + S)(I − S) −1 according to(I + S)(I − S) −1 = UΦ(φ)U T = C φ (φ), (10)where U ∈ R 3×3 is orthogonal <strong>and</strong>⎡⎤cos(φ) − sin(φ) 0Φ(φ) = ⎣ sin(φ) cos(φ) 0 ⎦. (11)0 0 1The decomposition is described in [5] <strong>and</strong> is based on using <strong>the</strong> eigenvectors of(I + S)(I − S) −1 . The eigenvectors of S <strong>and</strong> (I + S)(I − S) −1 are <strong>the</strong> same.To see this in a simple way, let S = WDW H be a spectral decomposition withD = diag(id 1 , −id 1 , 0), d 1 ∈ R <strong>and</strong> W H W = I. Then(I + S)(I − S) −1 = (I + WDW H )(I − WDW H ) −1 == W(I + D)W H W(I − D) −1 W H = W ˜DW H ,where ˜D is a diagonal matrix containing <strong>the</strong> eigenvalues of (I + S)(I − S) −1 .To get <strong>the</strong> decomposition given in (10), letD := Y DY H , W := WY H ,whereY =⎡⎢⎣1√2− √ i20− √ i2√2 100 0 1⎤⎥⎦ , Y H Y = I ,<strong>the</strong>n D is skew-symmetric asD =⎡⎣ 0 −d 1 0d 1 0 00 0 0<strong>and</strong> let W have column vectors W = [w 1 , w 2 , w 3 ]. Observe that S still fulfillsS = WDW H . The introduction of Y is merely done to get a real valued skewsymmetricrepresentation of D. Consider now a search direction S <strong>and</strong> a scalarα giving <strong>the</strong> step αS, <strong>the</strong>n(I + αS)(I − αS) −1 = W(I + αD)(I − αD) −1 W H . (12)⎤⎦ ,


<strong>Algorithms</strong> <strong>for</strong> 3-dimensional <strong>Weighted</strong> <strong>Orthogonal</strong> <strong>Procrustes</strong> <strong>Problem</strong>s 37The Cayley trans<strong>for</strong>m of αD is an orthogonal matrix of <strong>the</strong> <strong>for</strong>m Φ(φ) in (11).The real matrix U in (10) is given by U = [u 1 , u 2 , u 3 ] = [ √ 2Re(w 1 ), √ 2Re(w 2 ), w 3 ].For a given direction S at a point ˜M, we can <strong>the</strong>n move along <strong>the</strong> surface off(M) by usingM(φ) = ˜MUΦ(φ)U T = ˜MC φ (φ). (13)Evidently M(0) = ˜M since C(0) = I.As indicated in (12), when using (13) <strong>the</strong> magnitude of S is somewhat unimportant.Also <strong>the</strong> negative search direction −S is in this context <strong>the</strong> same as S,corresponding to α = −1. This is analogy with moving on an ellipse clockwiseor counter-clock wise.We choose to set ||S|| F = √ 2 ⇒ ||s|| 2 = 1, <strong>and</strong> express C φ (φ) in terms of Sas⎡C φ (φ) = cosφ(I − u 3 u T 3 ) + sinφS + u 3 u T 3 , u 3 = ⎣ s ⎤3−s 2⎦. (14)s 1This results in that an explicit decomposition is not needed. Equation (14) isderived by noting that <strong>the</strong> tangent direction at a point ˜M, <strong>for</strong> a given directionS, is+ ∆φ)) − f(M(0))lim (f(M(0 ) = f( dM(0)∆φ→0 ∆φdφ ).The derivative at φ = 0 can <strong>the</strong>n be expressed as⎡dM(0) dC(0)= ˜Mdφ dφ= ˜MU ⎣ 0 −1 01 0 00 0 0⎤⎦U T .In (7), note that a tangent direction can be written as f( ˜M2S). Hence f( ˜MS)is also a tangent direction. Then we have that⎡U ⎣ 0 −1 0 ⎤1 0 0 ⎦U T = ±S,0 0 0if ||S|| F = √ 2. The choice of sign depends on <strong>the</strong> clockwise or counter-clockwiseanalogy. Consider <strong>the</strong> +S case using⎡U ⎣ 1 0 0 ⎤0 1 0 ⎦U T = I − u 3 u T 3 ,0 0 0we can write C φ (φ) as⎡C φ (φ) = U ⎣cos(φ) − sin(φ) 0sin(φ) cos(φ) 00 0 1⎤⎦U T =


38 Paper IcosφU⎡⎣ 1 0 00 1 00 0 0⎤⎦U T + sinφU⎡⎣ 0 −1 01 0 00 0 0⎤⎦U T + U= cosφ(I − u 3 u T 3 ) + sinφS + u 3u T 3 .⎡⎣ 0 0 00 0 00 0 1⎤⎦U T =Here u 3 is <strong>the</strong> screw axis of <strong>the</strong> rotation C φ (φ). The screw axis has <strong>the</strong> propertythat if <strong>the</strong> corresponding rotation is applied to <strong>the</strong> screw axis, it is unchanged,i.e., C φ (φ)u 3 = u 3 . u 3 is <strong>the</strong> eigenvector to C φ corresponding to <strong>the</strong> eigenvalueλ = 1. This results in that <strong>the</strong> screw axis <strong>for</strong> <strong>the</strong> rotation C φ (φ) is <strong>the</strong> third columnvector of U, since C φ (φ)u 3 = u 3 . Additionally, <strong>the</strong> screw axis is orthogonalto <strong>the</strong> tangent space represented by S, i.e.,⎡⎤⎡ ⎤Su 3 =⎣ 0 −s 1 −s 2s 1 0 −s 3s 2 s 3 0⎦u 3 = 0 ⇒ u 3 = ±⎣ s 3−s 2s 1As with <strong>the</strong> search direction S, <strong>the</strong> choice of <strong>the</strong> sign when choosing u 3 is not ofimportance here since it is included in (14) as u 3 u T 3 . For more details regarding<strong>the</strong> screw axis see [6].3.1 Computing <strong>the</strong> step lengthFor a given search direction s at <strong>the</strong> point ˜M, we can move along <strong>the</strong> surface off(M) by using (13). Solving <strong>the</strong> least squares problem⎦ .minφ||f(M(φ)) − b|| 2 2 (15)results in <strong>the</strong> angle ˆφ that gives <strong>the</strong> optimal step <strong>for</strong> <strong>the</strong> given search directions.Since f(M(φ)) describes an ellipse, (15) can be re<strong>for</strong>mulated into a problemof finding <strong>the</strong> point on an ellipse that is closest to a given point b p ∈ R 2 . To doso we use (14) to getf(M(φ)) = f( ˜MC φ (φ)) = f( ˜M(I − u 3 u T 3 ))cos φ + f( ˜MS)sin φ + f(u 3 u T 3 ) =[= [f( ˜M(I − u 3 u T cosφ3 )), f( ˜MS)]sin φNow (15) is equivalent with[ cosφmin ||A f φ sin φLet <strong>the</strong> singular value decomposition of A f beA f = U A Σ A V T A ,][+ f(u 3 u T cosφ3 ) = A fsin φ]+ c.]− (b − c)|| 2 2. (16)


<strong>Algorithms</strong> <strong>for</strong> 3-dimensional <strong>Weighted</strong> <strong>Orthogonal</strong> <strong>Procrustes</strong> <strong>Problem</strong>s 39where U A ∈ R 9×9 , V A ∈ R 2×2 are orthogonal <strong>and</strong> Σ A = diag(σ 1 , σ 2 ) ∈ R 9×2 .Substitute[ ] [ ]cosφ cosθVAT = ,sin φ sin θ<strong>and</strong> letU T A (b − c) = [bpb n],where b p ∈ R 2 , <strong>the</strong>n (16) is equivalent to[ ] [σ1 0 cosθmin ||θ 0 σ 2 sin θIf ˆθ is a solution to (17), <strong>the</strong> solution to (16) is given by[ ] [ ]cosφ cos ˆθ= Vsinφ Asin ˆθ .3.1.1 The 2 dimensional ellipse problemA problem of <strong>the</strong> <strong>for</strong>m[ ] [σ1 0 cosθmin ||θ 0 σ 2 sin θcorresponds to finding <strong>the</strong> point on <strong>the</strong> ellipse[ ] cosθΣ ∈ Rsin θ2 , Σ = diag(σ 1 , σ 2 ) ∈ R 2×2]− b p || 2 2. (17)]− ˜b|| 2 2 (18)that lies closest to <strong>the</strong> point ˜b ∈ R 2 .A simple <strong>and</strong> robust way of solving (18) is to solve a fourth degree polynomial.At an extreme point <strong>the</strong> tangent is orthogonal to <strong>the</strong> residual. Thetangent t at a given point ˜θ isso at an extreme point[t T cosθ(Σsin θt(˜θ) = Σ[− sin ˜θcos ˜θ],][− b) = [− sinθ, cosθ]Σ 2 cosθ(sin θis fulfilled. Substituting a = cosθ gives <strong>the</strong> equation,]− b) = 0−σ 1√1 − a2 (σ 1 a − b 1 ) + aσ 2 (σ 2√1 − a2 − b 2 ) = 0 ⇒(σ 4 1 −2σ2 2 σ2 1 +σ4 2 )a4 +(2σ 2 2 b 1σ 1 −2b 1 σ 3 1 )a3 +(2σ 2 2 σ2 1 −σ4 1 −σ4 2 +b2 2 σ2 2 +σ2 1 b2 1 )a2 ++(−2σ 2 2 b 1σ 1 + 2b 1 σ 3 1 )a − σ2 1 b2 1 = 0.Solving this equation <strong>for</strong> a yields four solutions. The solution with <strong>the</strong> leastresidual norm in (18), is <strong>the</strong> global minimum.


40 Paper Ix 2T˜bx 1Figure 1: Two minima are found where <strong>the</strong> tangent T is orthogonal to <strong>the</strong>distance vector from b to <strong>the</strong> ellipse (<strong>the</strong> residual r = b − f(Q)). The minimumin <strong>the</strong> first quadrant is global.4 WOPP algorithm <strong>for</strong> M ∈ R 3×3The algorithm to solve <strong>the</strong> optimization problemminM12 ||f(M) − b||2 2 , subject to M T M = I, det(M) = 1, (19)consists of several parts. First an initial value M 0 is computed. This is done byfirst solving a linear least squares problem with equality constraint [3]. Let ˜M 0be <strong>the</strong> solution tominM ||Fvec(M) − b||2 2 , subject to ||vec(M)|| 2 = √ 3. (20)Since ˜M 0 is not necessarily orthogonal, an orthogonal matrix M 0 approximating˜M 0 is computed by solving an OPP. Let M 0 be <strong>the</strong> solution tominM 0||M 0 − ˜M 0 || F , subject to M T 0 M 0 = I , det(M 0 ) = 1. (21)M 0 is now used as <strong>the</strong> initial value <strong>for</strong> a nonlinear solver to compute a solutionto (19). The algorithm has <strong>the</strong> following setup.5 Computational experimentsHere we present some computational results regarding


<strong>Algorithms</strong> <strong>for</strong> 3-dimensional <strong>Weighted</strong> <strong>Orthogonal</strong> <strong>Procrustes</strong> <strong>Problem</strong>s 41WOPP Algorithm: min ||f(M) − b|| 2 2 subject to M T M = I , det(M) = 1.0. Compute an initial matrix M 0 by solving (20) <strong>and</strong> (21).1. k = 0, ˆ∆ = 10 −10 , ∆ 0 = ˆ∆ + 1.2. While ∆ j > ˆ∆2.1. If (J T J + H) is positive definite,2.2. else2.1.1. compute a Newton search direction s = s N according to (9),2.2.1. take a Gauss-Newton search direction s = s GN according to(8).2.3 Solve (15) to get <strong>the</strong> optimal step length, i.e., optimal C φ (φ).2.4 Update M k+1 = M k C φ (φ).2.5. k = k + 1.2.6.3. end While.∆ j+1 = ||JT (f(Q j ) − b)|| 2||J|| 2 ||f(Q j ) − b|| 2.◦ <strong>the</strong> number of iterations needed by <strong>the</strong> WOPP algorithm to compute asolution,◦ number of minima to a (3),◦ <strong>and</strong> <strong>the</strong> reliability of <strong>the</strong> WOPP algorithm to compute <strong>the</strong> global minima.We consider two ways of adding noise to <strong>the</strong> problem. First we add noisecomponents relative to <strong>the</strong> magnitude of <strong>the</strong> elements in b. Second, we consider<strong>the</strong> case of having isotropic noise.To compute additional local minima <strong>for</strong> each generated test problem, aheuristic method was used. It has <strong>the</strong> following setup.The algorithm uses r<strong>and</strong>om initial matrices M 0 , combined with <strong>the</strong> WOPPalgorithm, to compute <strong>and</strong> store minima. When no new minimum is found,after that 100 r<strong>and</strong>om matrices have been used, <strong>the</strong> algorithm is terminated.This method works very well, though it is pretty time consuming. In most casesall minima were found after using less than 20 initial matrices (without resetting<strong>the</strong> counter k). However, to give <strong>the</strong> method a higher reliability, we choose touse too many initial matrices ra<strong>the</strong>r than too few.


42 Paper IHeuristic Algorithm <strong>for</strong> computing additional minima1. k := 02. while k < 1002.1 M 0 := R<strong>and</strong>om orthogonal matrix with det(M) = 1.2.2 ˆM := Computed minimum with M 0 as an initial matrix <strong>for</strong> <strong>the</strong>WOPP algorithm.2.3 If ˆM is a new minimum (not been computed earlier)2.3.1 Save ˆM.2.3.2 k := 0. (Reset counter)2.4 end if.2.5 k := k + 1.2.6 end while.5.1 Tests with a relative type of perturbationIn this section, we present computational results of <strong>the</strong> algorithm when appliedto different types of WOPPs generated according to <strong>the</strong> following.• Set a noise level γ.• <strong>for</strong> i = 2.5, 5, 10, 50, 200, 500• <strong>for</strong> j = 2.5, 5, 10, 50, 200, 500• <strong>for</strong> k = 1, 2, . . ., 200• A := r<strong>and</strong>om matrix of dimension m A by 3, with conditionnumber κ(A) = i. m A is a r<strong>and</strong>omly chosen integer in <strong>the</strong>interval [3, 20].• X := r<strong>and</strong>om matrix of dimension 3 by n X , with conditionnumber κ(X) = j. n X is a r<strong>and</strong>omly chosen integer in <strong>the</strong>interval [10, 100].• Generate a r<strong>and</strong>om rotation matrix ˆM.• ˆb := f( ˆM).• Generate a perturbation δb, <strong>and</strong> take b := ˆb + γδb.• Compute <strong>the</strong> canonical <strong>for</strong>m of <strong>the</strong> WOPP, F := X T ⊗ A.• Use F <strong>and</strong> b to compute a solutions with <strong>the</strong> WOPP algorithm,<strong>and</strong> to compute all minima with <strong>the</strong> heuristic algorithm.• end <strong>for</strong> k, end <strong>for</strong> j, end <strong>for</strong> i.


<strong>Algorithms</strong> <strong>for</strong> 3-dimensional <strong>Weighted</strong> <strong>Orthogonal</strong> <strong>Procrustes</strong> <strong>Problem</strong>s 43For a given noise level, this results in a total of 7200 tests, with differentcondition numbers of <strong>the</strong> matrix F,κ(F) = κ(X)κ(A).When generating δb, each element ˆb i is subject to a perturbation γδb i , withmagnitude relative to <strong>the</strong> value of ˆb i asδb i = ɛ i |ˆb i | , i = 1, ..., 9.Here ɛ i is a scalar chosen r<strong>and</strong>omly from <strong>the</strong> normal distribution. The scalarγ > 0 is <strong>the</strong>n used to control <strong>the</strong> magnitude of <strong>the</strong> noise level. This perturbationis used to not add a too large noise component, which can be <strong>the</strong> case if <strong>the</strong>values in ˆb are of very different magnitudes.Since <strong>the</strong> condition number κ(F) <strong>for</strong> some of <strong>the</strong> problems is quite large, atolerance of ˆ∆ −10 was used in <strong>the</strong> WOPP-solver.Tables 1-5 in Appendix C displays results <strong>for</strong> different choices of <strong>the</strong> noiselevel γ. Each table contains <strong>the</strong> following in<strong>for</strong>mation.κ(F) Condition number <strong>for</strong> <strong>the</strong> matrix F = X T ⊗ A.1m-4m The number of <strong>the</strong> generated test problems that had oneminimum, two minima, three <strong>and</strong> four minima.Total tests The total amount of test problems generated with conditionnumber κ(F).GM The percentage of how many of <strong>the</strong> test problems that resultedin a Global Minimum solution, by <strong>the</strong> WOPP algorithm.Avg.iter The average number of iterations per<strong>for</strong>med by <strong>the</strong> WOPPalgorithm to compute a solution.Std.dev The st<strong>and</strong>ard deviation <strong>for</strong> <strong>the</strong> average number of iterations.Let us illustrate <strong>the</strong> results by considering <strong>the</strong> third row in Table 1. A totalof 600 generated test problems, each with a condition number κ(F) = 25 <strong>and</strong>noise level γ = 0.05, were used. The average number of iterations were 5.15with a st<strong>and</strong>ard deviation 1.37. Out of <strong>the</strong>se 600 generated problems, 404 hadone minimum, 195 had two minima, one had three minima <strong>and</strong> none had fourminima (computed by <strong>the</strong> heuristic method). The WOPP algorithm computed<strong>the</strong> global minimum in 89% of <strong>the</strong>se 600 tests.5.2 Tests with an isotropic type of perturbationThe previous test problems used a relative noise level. In <strong>the</strong> following tests, wegenerate isotropic noise. By isotropic we mean that it is equal in all directions.The test problems are generated according to <strong>the</strong> following.


44 Paper I• Set a noise level γ.• <strong>for</strong> i = 2.5, 5, 10, 50, 200, 500• <strong>for</strong> k = 1, 2, . . ., 200• A := r<strong>and</strong>om matrix of dimension m A by 3, with condition numberκ(A) = i, where m A is a r<strong>and</strong>omly chosen integer in <strong>the</strong>interval [3, 20].• ˆX := 1000 · r<strong>and</strong>n(3, n X ). n X is a r<strong>and</strong>omly chosen integer in<strong>the</strong> interval [10, 100].• Generate a r<strong>and</strong>om rotation matrix ˆM.• ˆB := ˆMX.• Generate a perturbation δB = 1000 · r<strong>and</strong>n(3, n X ).• Generate a perturbation δX = 1000 · r<strong>and</strong>n(3, n X ).• Add noise to ˆB, B := ˆB + γδB.• Add noise to ˆX, X := ˆX + γδX.• B := AB.• Compute <strong>the</strong> canonical <strong>for</strong>m of <strong>the</strong> WOPP.• F := X T ⊗ A, b := vec(B).• Use F <strong>and</strong> b to compute a solution with <strong>the</strong> WOPP algorithm.• Use F <strong>and</strong> b to compute all minima with <strong>the</strong> heuristic algorithm.• end <strong>for</strong> k, end <strong>for</strong> i.Here we generate n X number of points in R 3 (stored in ˆX). The points areplaced within a ”box” by using <strong>the</strong> MATLAB function r<strong>and</strong>n(·). This set ofpoints are <strong>the</strong>n scaled by 1000, to make <strong>the</strong> box bigger. The result is that mostelements in X are approximately smaller than 2000 or greater than −2000.The set of points is <strong>the</strong>n rotated to <strong>for</strong>m a new set ˆB. After that, a perturbationγδX <strong>and</strong> γδB is added to ˆX <strong>and</strong> ˆB, respectively. The size of <strong>the</strong>perturbation is correlated to <strong>the</strong> size of <strong>the</strong> box. Say that <strong>the</strong> box has a dimensionof 1000 × 1000 × 1000 millimeters. Then by choosing γ ≈ 0.001 we addnoise around 1 millimeter in magnitude.The condition number <strong>for</strong> X in <strong>the</strong>se cases is low, approximately around 1.The tests have been sorted <strong>and</strong> collected depending on <strong>the</strong> condition number ofA. Tables 6-10 in Appendix D displays results <strong>for</strong> different values of <strong>the</strong> noiselevel γ.5.3 Summary of computational resultsThe computational tests show that <strong>the</strong> algorithm generally computes a solutionin less than 10 iterations <strong>for</strong> <strong>the</strong> 42000 test problems considered. Choosing <strong>the</strong>initial matrix M 0 as described has shown to be a good starting approximation <strong>for</strong>computing a global minimizer, with around an 80% chance of success. During<strong>the</strong> tests with isotropic noise, <strong>the</strong> matrix X had a condition number of around


<strong>Algorithms</strong> <strong>for</strong> 3-dimensional <strong>Weighted</strong> <strong>Orthogonal</strong> <strong>Procrustes</strong> <strong>Problem</strong>s 451. This results in that <strong>the</strong> generated WOPPs ”almost becomes” an OPP withonly one minimum, due to <strong>the</strong> small differences of <strong>the</strong> singular values in X. Asseen, <strong>the</strong> number of local minima distinguishes quite much from <strong>the</strong> tests whenusing relative noise (where A <strong>and</strong> X had higher condition numbers). During<strong>the</strong> tests, <strong>the</strong> maximal number of minima found (with positive determinant)was four.AppendixA The canonical <strong>for</strong>m of a WOPPProposition A.1 The matrices A ∈ R mA×m <strong>and</strong> X ∈ R n×nX with Rank(A) =m <strong>and</strong> Rank(X) = n belonging to a WOPPmin 1 2 ||AQX − B||2 F , subject to QT Q = I n ,can always be considered as m by m <strong>and</strong> n by n diagonal matrices, respectively.be <strong>the</strong> singular value decom-Proof. Let A = U A Σ A VA T <strong>and</strong> X = U XΣ X VX Tposition of A <strong>and</strong> X. Then||U A Σ A V T A QU B Σ B V T B − B|| 2 F = ||U A Σ A ZΣ X V T X − B|| 2 F,where Z = V T A QU X ∈ R m×n has orthonormal columns. Since U T A U A = I mA<strong>and</strong> V T X V X = I nX it follows that||U A Σ A ZΣ X V T X − B||2 F = tr(U AΣ A ZΣ X V T X − B)T (U A Σ A ZΣ X V T X − B) == tr(V X Σ X Z T Σ 2 AZΣ X V T X − 2V X Σ X Z T Σ 2 AU T AB + B T B) == tr(Σ X Z T Σ 2 AZΣ X ) − tr(2Σ B Z T Σ A U T ABV X ) + tr(B T B) =tr(Σ A ZΣ X − U T A BV X) T (Σ A ZΣ X − U T A BV X) = ||Σ A ZΣ X − U T A BV X|| 2 F .Hence, without loss of generality we can assume that A = diag(α 1 , ..., α m ) <strong>and</strong>X = diag(χ 1 , ..., χ n ) with α i ≥ α i+1 ≥ 0 <strong>and</strong> χ i ≥ χ i+1 ≥ 0. ✷B The solution to an OPPTheorem B.1 Let X ∈ R n×n <strong>and</strong> B ∈ R m×n be known matrices with Rank(X) =n <strong>and</strong> Rank(B) = n. Then <strong>the</strong> solution ˆQ of <strong>the</strong> orthogonal <strong>Procrustes</strong> problemmin 1 2 ||QX − B||2 F , subject to Q T Q = I n , (22)is ˆQ = V I m,n U T , where V <strong>and</strong> U are <strong>the</strong> orthogonal matrices given by <strong>the</strong>singular value decomposition UΣV T = XB T .


46 Paper IProof. Since||QX − B|| 2 F = trace((QX − B)T (QX − B)) == trace((QX) T (QX)) + trace(B T B) − trace((QX) T B) − trace(B T (QX)) =Equation (22) is equivalent to||X|| 2 F + ||B||2 F − 2trace(BT QX),maxtrace(B T QX) , subject to Q T Q = I n . (23)Note that trace(B T QX) = trace(XB T Q) <strong>and</strong> let UΣV T = XB T be a singularvalue decomposition. Use <strong>the</strong> matrix Z = V T QU, Z ∈ R m×n , <strong>and</strong> wegetn∑trace(XB T Q) = trace(ΣV T QU) = trace(ΣZ) = σ i z i,i .Since Z has orthonormal columns, <strong>the</strong> upper bound of (23) is given by havingz i,i = 1, i.e., Z = I m,n . The solution to (22) is <strong>the</strong>n V T QU = I m,n ⇒ Q =V I m,n U T . ✷If we consider <strong>the</strong> balanced case of a WOPP with X = I n ,min 1 2 ||AQ − B||2 F , subject to QT Q = I n , (24)<strong>the</strong>n (24) is an OPP since Q is orthogonal [4].i=1


<strong>Algorithms</strong> <strong>for</strong> 3-dimensional <strong>Weighted</strong> <strong>Orthogonal</strong> <strong>Procrustes</strong> <strong>Problem</strong>s 47C Tables <strong>for</strong> <strong>the</strong> tests with a relative type ofperturbationκ(F) 1m 2m 3m 4m GM Avg.iter Std.dev Total tests6.25 165 35 0 0 94% 5.2 1.81 20012.5 316 84 0 0 95% 5.06 1.32 40025 404 195 1 0 89% 5.15 1.37 60050 167 226 7 0 88% 5.34 1.49 400100 44 151 4 1 83% 5.44 1.4 200125 309 91 0 0 99% 5.42 1.18 400250 140 254 6 0 89% 5.62 1.3 400500 376 401 13 10 89% 5.85 1.47 8001000 134 260 6 0 89% 5.99 1.19 4001250 323 77 0 0 98% 5.83 1.6 4002000 60 321 11 8 86% 6.16 1.99 4002500 149 427 13 11 87% 6.11 1.16 6005000 57 321 12 10 83% 6.39 1.41 40010000 5 359 5 31 73% 6.37 1.63 40025000 7 341 13 39 82% 6.42 1.29 40040000 3 175 3 19 88% 6.4 2.15 200100000 0 349 0 51 81% 6.47 1.23 400250000 1 174 2 23 83% 6.83 1.5 200Table 1: Results when using γ = 0.05, corresponding to a noise level around5%.κ(F) 1m 2m 3m 4m GM Avg.iter Std.dev Total tests6.25 171 29 0 0 95% 4.99 1.37 20012.5 311 89 0 0 92% 5.25 1.55 40025 394 204 2 0 94% 5.23 1.52 60050 175 219 6 0 89% 5.51 1.44 400100 52 143 4 1 85% 5.63 1.53 200125 299 101 0 0 98% 5.67 2.89 400250 145 250 5 0 91% 5.76 1.26 400500 356 426 12 6 89% 5.8 1.06 8001000 135 260 4 1 90% 6.03 1.22 4001250 308 92 0 0 98% 5.74 0.87 4002000 51 324 12 13 82% 6.17 1.12 4002500 148 409 19 24 89% 6.06 1.02 6005000 47 337 7 9 83% 6.32 1.36 40010000 9 350 9 32 79% 6.42 1.29 40025000 5 342 8 45 74% 6.53 1.46 40040000 0 175 2 23 83% 6.53 1.4 200100000 7 357 3 33 86% 6.67 2.74 400250000 2 177 2 19 88% 6.49 1.21 200Table 2: Results when using γ = 0.1, corresponding to a noise level around 10%.


48 Paper Iκ(F) 1m 2m 3m 4m GM Avg.iter Std.dev Total tests6.25 168 32 0 0 94% 4.95 1.63 20012.5 313 86 1 0 94% 5.14 1.3 40025 400 200 0 0 92% 5.25 1.27 60050 162 234 4 0 88% 5.53 1.45 400100 45 149 5 1 86% 5.65 1.26 200125 319 81 0 0 97% 5.63 1.13 400250 153 242 3 2 91% 5.85 1.1 400500 368 415 14 3 90% 5.85 1.13 8001000 144 250 4 2 91% 6.09 1.17 4001250 312 88 0 0 98% 5.74 0.91 4002000 50 332 9 9 82% 6.29 1.21 4002500 133 430 12 25 88% 6.13 1.15 6005000 49 332 6 13 84% 6.31 1.44 40010000 7 355 11 27 79% 6.34 1.14 40025000 11 350 10 29 82% 6.34 1.27 40040000 4 171 5 20 86% 6.49 1.13 200100000 3 361 4 32 84% 6.49 1.43 400250000 3 175 1 21 89% 6.58 2.26 200Table 3: Results from a of data with noise level around 15%, i.e., γ = 0.15.κ(F) 1m 2m 3m 4m GM Avg.iter Std.dev Total tests6.25 168 32 0 0 96% 5.07 1.47 20012.5 326 74 0 0 96% 5.06 1.29 40025 388 212 0 0 93% 5.29 1.3 60050 162 236 2 0 87% 5.69 1.38 400100 38 152 9 1 84% 5.55 1.19 200125 307 93 0 0 98% 5.55 0.92 400250 132 265 2 1 90% 5.96 1.24 400500 367 412 12 9 90% 5.86 1.02 8001000 146 248 5 1 90% 6.07 1.13 4001250 304 96 0 0 99% 5.71 0.93 4002000 49 333 12 6 86% 6.16 1.15 4002500 135 438 15 12 90% 6.08 1.1 6005000 50 329 12 9 82% 6.27 1.2 40010000 9 345 13 33 81% 6.29 1.13 40025000 6 351 12 31 80% 6.31 1.53 40040000 5 175 1 19 90% 6.45 1.32 200100000 4 351 2 43 88% 6.4 1.26 400250000 2 176 1 21 86% 6.81 2.71 200Table 4: Results <strong>for</strong> a noise level around 20%, γ = 0.2.


<strong>Algorithms</strong> <strong>for</strong> 3-dimensional <strong>Weighted</strong> <strong>Orthogonal</strong> <strong>Procrustes</strong> <strong>Problem</strong>s 49κ(F) 1m 2m 3m 4m GM Avg.iter Std.dev Total tests6.25 169 31 0 0 94% 5.12 1.55 20012.5 318 81 1 0 94% 5.24 1.29 40025 399 200 1 0 95% 5.35 1.06 60050 142 253 5 0 85% 5.73 1.27 400100 49 143 7 1 86% 5.88 1.14 200125 304 96 0 0 97% 5.62 0.87 400250 129 267 3 1 90% 5.93 1.09 400500 357 427 11 5 91% 5.9 1.04 8001000 141 251 7 1 87% 6.05 1.19 4001250 298 102 0 0 98% 5.71 1.06 4002000 50 328 9 13 82% 6.18 1.2 4002500 148 427 12 13 91% 6.01 1.17 6005000 52 331 9 8 85% 6.22 2.26 40010000 11 349 8 32 83% 6.38 1.21 40025000 21 339 6 34 81% 6.46 2.45 40040000 7 171 6 16 86% 6.5 2.07 200100000 9 348 3 40 87% 6.51 1.98 400250000 10 167 1 22 86% 7.1 5.89 200Table 5: Results <strong>for</strong> a noise level around 50%, γ = 0.5.


50 Paper ID Tables <strong>for</strong> <strong>the</strong> tests with an isotropic type ofperturbationκ(A) 1m 2m 3m 4m GM Avg.iter Std.dev Total tests2.5 388 12 0 0 98% 4.3 2.06 4005 399 1 0 0 100% 4.47 1.51 40010 397 3 0 0 100% 4.33 1.54 40050 396 4 0 0 99% 4.42 1.61 400200 398 2 0 0 100% 4.45 1.64 400500 398 2 0 0 100% 4.43 1.66 4001000 394 6 0 0 100% 4.48 1.63 4005000 392 8 0 0 99% 4.58 1.76 400Table 6: Results with γ = 0.001.κ(A) 1m 2m 3m 4m GM Avg.iter Std.dev Total tests2.5 394 6 0 0 99% 4.34 1.65 4005 395 5 0 0 99% 4.32 1.5 40010 393 7 0 0 99% 4.29 1.46 40050 394 6 0 0 99% 4.55 1.54 400200 398 2 0 0 100% 4.6 1.61 400500 397 3 0 0 100% 4.76 1.58 4001000 398 2 0 0 100% 4.7 1.62 4005000 393 7 0 0 99% 4.7 1.64 400Table 7: Results with γ = 0.01.κ(A) 1m 2m 3m 4m GM Avg.iter Std.dev Total tests2.5 397 3 0 0 99% 4.72 1.93 4005 397 3 0 0 100% 4.75 1.28 40010 393 7 0 0 99% 4.98 1.4 40050 398 2 0 0 100% 4.92 1.29 400200 397 3 0 0 100% 5.05 1.55 400500 397 3 0 0 99% 5.13 1.44 4001000 395 5 0 0 99% 5.02 1.41 4005000 389 11 0 0 99% 5.14 1.45 400Table 8: Results with γ = 0.1.


<strong>Algorithms</strong> <strong>for</strong> 3-dimensional <strong>Weighted</strong> <strong>Orthogonal</strong> <strong>Procrustes</strong> <strong>Problem</strong>s 51κ(A) 1m 2m 3m 4m GM Avg.iter Std.dev Total tests2.5 397 3 0 0 99% 4.96 1.52 4005 398 2 0 0 100% 4.98 1.34 40010 395 5 0 0 100% 5.06 1.36 40050 390 10 0 0 99% 5.04 1.36 400200 395 5 0 0 99% 5.29 1.46 400500 395 5 0 0 99% 5.28 1.47 4001000 394 6 0 0 100% 5.1 1.39 4005000 395 5 0 0 100% 5.12 1.46 400Table 9: Results with γ = 0.2.κ(A) 1m 2m 3m 4m GM Avg.iter Std.dev Total tests2.5 391 9 0 0 99% 5.42 2.05 4005 386 14 0 0 99% 5.36 1.61 40010 391 9 0 0 99% 5.47 1.49 40050 389 9 2 0 99% 5.45 1.55 400200 388 12 0 0 98% 5.68 1.43 400500 391 9 0 0 99% 5.6 1.58 4001000 390 10 0 0 99% 5.51 1.41 4005000 385 15 0 0 98% 5.72 1.45 400Table 10: Results with γ = 0.5.E Finding a condition <strong>for</strong> a global minimizerConsider <strong>the</strong> two dimensional ellipse[σ1 cos(φ)e(φ) =σ 2 sin(φ)<strong>and</strong> let b = [˜x, ỹ] T be a point in <strong>the</strong> first quadrant, i.e., ˜x > 0 <strong>and</strong> ỹ > 0. Theglobal minimum ˆφ = arg{min ||e(φ) −b|| 2 } lies in <strong>the</strong> first quadrant. By movingb along <strong>the</strong> elongation of <strong>the</strong> normal to <strong>the</strong> tangent e ′ (ˆφ), ˆφ will remain as <strong>the</strong>global minimum until b hits <strong>the</strong> x-axis at b 0 = [x, 0] T . If b = b 0 <strong>the</strong> problemconsists of two local minimum.Let ê = e(ˆφ), <strong>the</strong>n ˆφ is only a global minimizer i<strong>for</strong>],ê T (b − ê) > ê T (b 0 − ê) (25)||ê|| · ||b − ê|| cos(α) > ||ê|| · ||b 0 − ê|| cos(β)according to Figure 2. If b is inside <strong>the</strong> ellipse <strong>the</strong>n <strong>the</strong> intermediate anglesα <strong>and</strong> β will be equal <strong>and</strong> α = β > π/2 hence cos(α) = cos(β) < 0. (25)is <strong>the</strong>n fulfilled since ||b − ê|| < ||b 0 − ê||. If b lies outside <strong>the</strong> ellipse <strong>the</strong>nα = β + π ⇒ cos(α) > 0 so (25) holds, since cos(β) < 0.


52 Paper Iyσ 2αe(φ)βbxσ 1Figure 2: The global minimum exists in <strong>the</strong> first quadrant as long as b doesnot cross <strong>the</strong> x-axis. The angle between e(φ) <strong>and</strong> b − e(φ) is β if b is inside <strong>the</strong>ellipse, o<strong>the</strong>rwise it will be α.Theorem 1 The residual b 0 − ê = [x, 0] T − ê that is orthogonal to <strong>the</strong> tangente ′ (φ) of <strong>the</strong> ellipse e(φ) at φ = ˆφ fulfills <strong>the</strong> conditionê T (b 0 − ê) = −σ 2 2Proof. Express b 0 by using <strong>the</strong> fact that that [b 0 − ê] is orthogonal to e ′ (ˆφ):[b 0 − ê] T e ′ (ˆφ) = −σ 1 xsin(ˆφ) + cos(ˆφ)sin(ˆφ)(σ 2 1 − σ2 2 ) = 0.Solving <strong>for</strong> x yieldsThenb 0 =[]cos(ˆφ)(σ 1 2 −σ2 2 )σ 1 .0ê T (b 0 − ê) = σ 1 cos(ˆφ)( cos φ(σ2 1 − σ 2 2)σ 1− σ 1 cos ˆφ) − σ 2 2 sin 2 ˆφ == cos 2 ˆφ(σ21 − σ 2 2 ) − σ2 1 cos2 ˆφ − σ22 sin 2 ˆφ = −σ22 . ✷To generalize this to a condition <strong>for</strong> a global minimizer to a WOPPminM12 ||f(M) − b||2 2 , subject to MT M = I,recall that <strong>for</strong> a given search direction s at a point ˆM, f(M(φ)) describes anellipse. Hence we can conclude that <strong>for</strong> a global minimizer, (25) must hold <strong>for</strong>


<strong>Algorithms</strong> <strong>for</strong> 3-dimensional <strong>Weighted</strong> <strong>Orthogonal</strong> <strong>Procrustes</strong> <strong>Problem</strong>s 53all search directions s ∈ R 3 , i.e., <strong>for</strong> all s with ||s|| 2 = 1. That is, if ˆM is aglobal minimizer <strong>the</strong> following inequality must hold:a T 1 (s)(b − f( ˆM)) ≥ −σ 2 2(s) ∀ ||s|| 2 = 1. (26)Here σ2 2 (s) is <strong>the</strong> magnitude of <strong>the</strong> semi-minor axis of <strong>the</strong> ellipse f(M(φ)) <strong>and</strong>a 1 (s) = f( ˆM(I − u 3 u T 3 )). σ 2(s) is <strong>the</strong> smallest singular value of A f whereA f = [a 1 , a 2 ] = [f( ˆM(I − u 3 u T 3 )), f( ˆMS)]coming from <strong>the</strong> ellipsef(M(φ)) = A f[cos(φ)sin(φ)].However, it seems as a verification of this, practically, is more complicated thatsolving <strong>the</strong> problem itself. Let λ(A T f A f) denote <strong>the</strong> eigenvalues of A T f A f, <strong>the</strong>nσ 2 can be expressed asσ2 2 = min(λ(AT f A f)) =√= aT 1 a 1 + a T 2 a 2 (aT− 1 a 1 ) 2 + (a T 2 a 2) 2 + 4(a T 1 a 2) 2 − 2a T 1 a 1a T 2 a 2.22Using this expression to analyze (26) seems to be more complicated than <strong>the</strong>WOPP itself.References[1] P. G. Batchelor <strong>and</strong> J. M. Fitzpatrick. A Study of <strong>the</strong> Anisotropically<strong>Weighted</strong> <strong>Procrustes</strong> <strong>Problem</strong>. IEEE Workshop on Ma<strong>the</strong>matical Methodsin Biomedical Image Analysis (MMBIA’00), page 212, 2000.[2] M. T. Chu <strong>and</strong> N. T. Trendafilov. On a Differential Equation Approach to<strong>the</strong> <strong>Weighted</strong> <strong>Orthogonal</strong> <strong>Procrustes</strong> <strong>Problem</strong>. Statistics <strong>and</strong> Computing,8(2):125–133, 1998.[3] W. G<strong>and</strong>er. Least Squares with a Quadratic Constraint. Numer. Math.,36:291–307, 1981.[4] G. H. Golub <strong>and</strong> C. F. Van Loan. Matrix Computations. The Johns HopkinsUniversity Press, 1989.[5] P. R. Halmos. Finite-dimensional vector spaces. Van Nostr<strong>and</strong>, 1958.[6] I. Söderkvist. Some Numerical Methods <strong>for</strong> Kinematical Analysis. ISSN-0348-0542, UMINF-186.90, Department of Computing Science, Umeå University,1990.[7] I. Söderkvist <strong>and</strong> Per-Åke Wedin. On Condition Numbers <strong>and</strong> <strong>Algorithms</strong><strong>for</strong> Determining a Rigid Body Movement. BIT, 34:424–436, 1994.


54 Paper I


Paper II<strong>Algorithms</strong> <strong>for</strong> Linear Least Squares <strong>Problem</strong>son <strong>the</strong> Stiefel manifold ∗Thomas Vikl<strong>and</strong>s † <strong>and</strong> Per-Åke WedinDepartment of Computing Science, Umeå UniversitySE-901 87 Umeå, Sweden.vikl<strong>and</strong>s@cs.umu.se, pwedin@cs.umu.seAbstractIn this paper we consider optimization problems on <strong>the</strong> <strong>for</strong>m min Q ||f(Q)−b|| 2 2 subject to Q T Q = I n, where Q ∈ R m×n with n ≤ m. f(Q) ∈ R k isa function linear in Q, <strong>and</strong> b ∈ R k is a known vector. <strong>Problem</strong>s of thiskind are <strong>for</strong> instance different types of <strong>Procrustes</strong> problems. The matrixexponential is used to make local parameterizations of Q. The algorithmis based on Newton <strong>and</strong> Gauss Newton search directions <strong>and</strong> uses optimalstep length in each step. The algorithm has been implemented in asoftware package <strong>for</strong> MATLAB.Keywords : <strong>Weighted</strong>, orthogonal, <strong>Procrustes</strong>, global minimum, Stiefel manifold,Cayley trans<strong>for</strong>m, skew symmetric.∗ From UMINF-06.07, 2006.† Financial support has partly been provided by <strong>the</strong> Swedish Foundation <strong>for</strong> Strategic Researchunder <strong>the</strong> frame program grant A3 02:128.57


58 Paper IIContents1 Introduction 592 Parametrization of V m,n 603 Series expansion 613.1 Computing <strong>the</strong> optimal step length . . . . . . . . . . . . . . . . . 624 The overall algorithm 635 Computational experiments 645.1 Relative perturbations . . . . . . . . . . . . . . . . . . . . . . . . 645.2 Null space perturbations . . . . . . . . . . . . . . . . . . . . . . . 655.2.1 Tests with non-square F . . . . . . . . . . . . . . . . . . . 656 Summary of computational experiments 65A The canonical <strong>for</strong>m of a WOPP 66B Parametrization of V m,n by using <strong>the</strong> Cayley trans<strong>for</strong>m 67B.1 Search directions with Cayley representation . . . . . . . . . . . 68C Results from computational experiments 70C.1 Tables <strong>for</strong> <strong>the</strong> relative type of perturbations . . . . . . . . . . . . 70C.2 Tables <strong>for</strong> null space perturbations . . . . . . . . . . . . . . . . . 72C.2.1 Using a square matrix F ∈ R mn×mn . . . . . . . . . . . . 72C.2.2 Using non-square matrix F ∈ R (mn−n)×mn . . . . . . . . 76References 78


<strong>Algorithms</strong> <strong>for</strong> Linear Least Squares problems on <strong>the</strong> Stiefel manifold 591 IntroductionConsider <strong>the</strong> minimization of a quadratic function of <strong>the</strong> <strong>for</strong>m ||f(Q) − b|| 2 2,where || · || 2 is <strong>the</strong> Euclidean 2-norm, on <strong>the</strong> Stielfel manifold [12],V m,n = {Q ∈ R m×n : Q T Q = I n , n ≤ m}.Here f(Q) ∈ R k is a function linear in Q <strong>and</strong> b ∈ R k is a known vector.Embedding Stiefel manifolds in an mn-dimensional Euclidian space can be donewith <strong>the</strong> vec-operator. vec(Q) is a stacking of <strong>the</strong> columns of Q = [q 1 , ..., q n ]into a vector⎡vec(Q) = ⎢⎣⎤q 1q 2..⎥⎦ .q nOften in connection to <strong>the</strong> vec-operator <strong>the</strong> Kronecker product ⊗ appears whenembedding matrix functions, e.g.,• AQX ⇒ f(Q) = (X T ⊗ A)vec(Q),• AQX + CQ T D ⇒ f(Q) = (X T ⊗ A + (D T ⊗ C)P)vec(Q), where P is asuitable permutation matrix such that Pvec(Q) = vec(Q T ).We consider optimization problems on <strong>the</strong> <strong>for</strong>mminQ12 ||f(Q) − b||2 2 , subject to Q ∈ V m,n , (1)where f(Q) ∈ R k can be written as f(Q) = Fvec(Q) with F ∈ R k×mn . Also weassume that F corresponds to a reasonable problem <strong>for</strong>mulation. A (perhapsextreme) example of an unreasonable problem is: Let Q ∈ R 3×2 <strong>and</strong> F =[1, 0, . . ., 0] ∈ R 1×6 . Evidently <strong>for</strong> any given b ∈ R 1 , (1) has infinite number ofsolutions. A better way of solving this problem would be to re<strong>for</strong>mulate it as,e.g.,minc|1 · c − b| , subject to |c| ≤ 1,where c is a scalar. However, this optimization problem is not on <strong>the</strong> <strong>for</strong>m statedin (1). Practically, we could say that F <strong>and</strong> b should correspond to a sufficientlynumber of observations such that (1) makes sense. Assume Q(s) with s ∈ R pis a parametrization of V m,n . It is desired that <strong>the</strong> Jacobian J ∈ R k×p of f(Q)should have full column rank, which does not occur if k < p. Also having a verysparse F could result in rank(J) < p, event though k ≥ p <strong>and</strong> rank(F) ≥ p. Thealgorithm presented in this contribution can be used to some extent to solve (1)in <strong>the</strong>se unreasonable cases, but it is not tailor-made to deal with <strong>the</strong>m.However, some optimization problems that can be <strong>for</strong>mulated as (1) aredifferent classes of <strong>Procrustes</strong> problems. For instance, <strong>the</strong> weighted orthogonal<strong>Procrustes</strong> problem (WOPP)min ||AQX − B|| 2 F , subject to Q ∈ V m,n , (2)


60 Paper IIwhere A, X <strong>and</strong> B are known matrices. For a WOPP one can assume thatA = diag(α 1 , ..., α m ) <strong>and</strong> X = diag(χ 1 , ..., χ n ) are m × m <strong>and</strong> n × n diagonalmatrices respectively, where α i ≥ α i+1 ≥ 0 <strong>and</strong> χ i ≥ χ i+1 ≥ 0, see Appendix A.f(Q) in (1) is <strong>the</strong>n f(Q) = Fvec(Q) with F = (X T ⊗ A) = diag(χ 1 A, ..., χ n A).Hence F ∈ R mn×mn is also a diagonal matrix. It is indeed wise to do thisre<strong>for</strong>mulation since it results in a smaller computational cost, as opposed tousing <strong>the</strong> original A <strong>and</strong> X with possibly greater dimensions <strong>and</strong> dense structure.Earlier work in connection to iterative algorithms <strong>for</strong> solving problems similarto (2) is reported in [1–5,7–11,13].This paper proposes a Newton <strong>and</strong> Gauss-Newton based method to solveoptimization problems on <strong>the</strong> <strong>for</strong>m in (1). As a parametrization of Q ∈ V m,n ,<strong>the</strong> matrix exponential of a skew-symmetric matrix is used. Given a searchdirection (Newton or Gauss-Newton) <strong>the</strong> optimal step length is computed ineach iteration. This is done by computing <strong>the</strong> roots of a polynomial.2 Parametrization of V m,nTwo common ways to represent an orthogonal matrix Q ∈ R m×m are to use amatrix exponential function or a Cayley trans<strong>for</strong>m. Our first approach was touse <strong>the</strong> Cayley trans<strong>for</strong>m, see Appendix B. Later on it was found out that byusing <strong>the</strong> matrix exponential instead, optimal step lengths could be computedquite simple, see Section 3.1. Anyhow, both representations make use of askew-symmetric matrix⎡S(s) = ⎢⎣⎤0 −s 1 −s 2 . . . −s m−1s 1 0 −s m −s m+1 . . .s 2 s m 0⎥. ....⎦ , S = −ST ∈ R m×m ,where s = [s 1 , s 2 , ..., s p ] T ∈ R p . The number of parameters s 1 , s 2 , . . . , s p arep = (m 2 − m)/2 when Q ∈ R m×m . Given a point ˜Q we can represent any Q in<strong>the</strong> vicinity of ˜Q asQ(s) = ˜Q exp(S(s)).A parametrization of V m,n when n < m can be done as in [3]. Let ˜Q ∈ R m×nbe a given point, <strong>the</strong>n letQ(s) = [ ˜Q, ˜Q ⊥ ] exp(S(s))I m,n , (3)be a local parametrization around ˜Q. Here ˜Q ⊥ is a column expansion to make[ ˜Q, ˜Q ⊥ ] ∈ R m×m orthogonal <strong>and</strong>[ ]InI m,n = ∈ R m×n .0The number of parameters are now p = mn − (n 2 + n)/2 <strong>and</strong> S ∈ R m×m have<strong>the</strong> <strong>for</strong>m[ ]S1,1 −S2,1S =T , (4)S 2,1 0


<strong>Algorithms</strong> <strong>for</strong> Linear Least Squares problems on <strong>the</strong> Stiefel manifold 61where S 1,1 ∈ R n×n is skew-symmetric, S 2,1 ∈ R (m−n)×n <strong>and</strong> <strong>the</strong> lower rightpart is a (m − n) × (m − n) zero matrix.The Cayley trans<strong>for</strong>m can also be used to make a parametrization of V m,n ,see Appendix B.3 Series expansionIn order to compute Newton or Gauss-Newton search directions to (1), <strong>the</strong>Jacobian J ∈ R k×p of f(Q) <strong>and</strong> <strong>the</strong> second order derivative matrix H ∈ R p×pof (1) is needed. To derive <strong>the</strong>se we use <strong>the</strong> series expansion of exp(S),exp(S) = I + S + S22 + S33! + S44! + . . .The expansion of Q(s) <strong>and</strong> f(Q(s)) around ˜Q is <strong>the</strong>n<strong>and</strong>Q(s) = [ ˜Q, ˜Q ⊥ ](I + S + S22! + S33! + S44! + . . .)I m,nf(Q(s)) = f( ˜Q)+f([ ˜Q, ˜Q ⊥ ]SI m,n )+f([ ˜Q, ˜Q ⊥ ] S22! I m,n)+... = f( ˜Q)+Js+...The Jacobian can be expressed with column vectors asJ = [j 2,1 , . . .,j m,1 , j 3,2 , . . .,j m,2 , . . . , j m,n ](5)wherej i,j = f([ ˜Q, ˜Q ⊥ ](e i e T j − e je T i )I m,n).By using <strong>the</strong> first order approximation in (5), a Gauss-Newton search direction<strong>for</strong> (1) is given by solvingmins||f( ˜Q) + Js − b|| 2 2 ,i.e.,s GN = −J + (f( ˜Q) − b). (6)The Newton search direction to (1) is given bys N = −(J T J + H) −1 (J T (f( ˜Q) − b)), (7)where H ∈ R p×p is a symmetric matrix containing second order derivatives. Anelement in h i,j ∈ H when i ≠ j can be written ash i,j = (f( ˜Q) − b) T f( ˜QT ij I m,n ).


62 Paper IIT ij ∈ R p×p is a sparse matrix 1 with ei<strong>the</strong>r 2 elements not equal to zero or insome cases all equal to zero.T ij = 1 2 (S(˜s))2 − D˜s<strong>and</strong> <strong>the</strong> elements i <strong>and</strong> j in ˜s equals 1 <strong>and</strong> <strong>the</strong> rest is zero. D˜s is a diagonalmatrix to eliminate any diagonal elements in 1 2 (S(˜s))2 . When i = j we geth i,i = 2(f(Q) − b) T f( ˜QT ii I m,n ),<strong>and</strong> nowT ii = 1 2 (S(˜s))2where element i in ˜s equals 1 <strong>and</strong> <strong>the</strong> remaining are zeroes. T ii is a diagonalmatrix <strong>and</strong> contains only 2 elements not equal to zero.3.1 Computing <strong>the</strong> optimal step lengthFor a given descent direction s at a point ˜Q obtained by solving (6) or (7),moving along <strong>the</strong> surface of f(Q) in <strong>the</strong> direction s is done by usingQ(βs) = [ ˜Q, ˜Q ⊥ ] exp(βS(s))I m,n ,where β > 0 is a scalar. The optimal step length is found by solving1minβ 2 ||f(Q(βs)) − b||2 2. (8)To do this we use a finite series expansion of exp(βS) of order t,exp(βS) ≈ I + Sβ + S22 β2 + S33! β3 + S44! β4 + . . . + Stt! βt .The objective function in (8) is <strong>the</strong>n approximated according to||f(Q(βs)) − b|| 2 2 ≈ (f 0 + f 1 β + ... + f t β t ) T (f 0 + f 1 β + ... + f t β t ) = P 2t (β),where f 0 = f( ˜Q) − b,f i = f([ ˜Q, ˜Q ⊥ ] S(s)i I m,n ) , i = 1, 2, ..., t.i!Hence P 2t (β) is a polynomial in β of degree 2t. By choosing t sufficiently large,<strong>the</strong> critical points of (8) is given by computing <strong>the</strong> positive real solutions to apolynomial of degree 2t − 1, i.e.,ddβ P 2t(β) = P 2t−1 (β) = 0.1 We do not mean <strong>the</strong> element at position (i, j) in a matrix T


<strong>Algorithms</strong> <strong>for</strong> Linear Least Squares problems on <strong>the</strong> Stiefel manifold 63Let β i , i = 1, 2, . . . be <strong>the</strong> positive real solutions of P 2t−1 (β) = 0 ordered suchthat β i < β i+1 . Since s is a descent direction, β 1 is always a minimum or saddlepoint to (8). Consider <strong>the</strong> interval (0, µ] where µ > 0, <strong>the</strong> optimal step lengthon <strong>the</strong> interval is given byˆβ = arg{min(P 2t (β i ), P 2t (µ)) ∀ β i < µ}.Typically when considering local convergence µ = 1 is used, corresponding to afull step length.4 The overall algorithmTo get a starting approximation Q 0 <strong>for</strong> <strong>the</strong> nonlinear solver first a least squaresproblem with equality constraint is solved,˜Q 0 = minQ {||Fvec(Q) − b||2 2 , subject to vec(Q) T vec(Q) = √ n} (9)Since ˜Q 0 ∈ R m×n does not necessarily has orthonormal columns, <strong>the</strong> OPPQ 0 = arg{minQ ||Q − ˜Q 0 || 2 F , subject to Q ∈ V m,n} (10)is solved yielding Q 0 ∈ V m,n , which is used as <strong>the</strong> initial value <strong>for</strong> <strong>the</strong> iterativealgorithm. The algorithm to compute a minimum to (1) works as follows.Algorithm: min ||f(Q) − b|| 2 2 subject to Q ∈ V m,n0. Compute Q 0 by solving (9) <strong>and</strong> (10).1. j = 0, µ = 1, ˆ∆ = 10 −10 , ∆ 0 = ˆ∆ + 1.2. While ∆ j > ˆ∆2.1. If (J T J + H) is positive definite2.1.1. compute a Newton search direction s = s N (7),2.2. else2.2.1. take a Gauss-Newton search direction s = s GN (6).2.3 Compute optimal step length ˆβ on <strong>the</strong> interval (0, µ] by solving (8).2.4 Update Q j+1 = [Q j , (Q j ) ⊥ ] exp(S(ˆβs))I m,n .2.5. j = j + 1.2.6.3. end While.∆ j+1 = ||JT (f(Q j ) − b)|| 2||J|| 2 ||f(Q j ) − b|| 2.The algorithm has been implemented in MATLAB, <strong>and</strong> can be downloadedfromhttp://www.cs.umu.se/~vikl<strong>and</strong>s/WOPP/index.html.


64 Paper II5 Computational experimentsIn this section, <strong>the</strong> algorithm presented is tested on r<strong>and</strong>omly generated problemsof different dimensions Q ∈ R m×n <strong>and</strong> F ∈ R k×mn . We mainly investigate<strong>the</strong> ability of computing a minimizer, <strong>and</strong> <strong>the</strong> number of iterations needed todo so. But, we also present some results regarding <strong>the</strong> efficiency of computing<strong>the</strong> global minimum.The matrix F is generated as a matrix with normally distributed r<strong>and</strong>omnumbers. Then by manipulating <strong>the</strong> singular values of F different conditionnumbers can be chosen. A r<strong>and</strong>om solution ˆQ is generated, <strong>and</strong> <strong>the</strong> exactmodel is <strong>the</strong>n ˆb = Fvec( ˆQ). To generate b, letb = ˆb + γ¯b,where ¯b is a perturbation <strong>and</strong> γ > 0 a scalar. Some different methods to choose¯b ∈ R k have been considered.1. Let each element ¯b i = ɛ i |ˆb i |, i = 1, ..., k, where ɛ i is a scalar chosen r<strong>and</strong>omlyfrom <strong>the</strong> normal distribution.2. Assume that Q is parameterized with p parameters. If k > p, we cancompute <strong>the</strong> jacobian at f( ˆQ) <strong>and</strong> let N ∈ R k×(k−p) be a basis of <strong>the</strong> nullspace of J T . Now take ¯b = ρNx where x ∈ R k−p is a vector with normallydistributed r<strong>and</strong>om numbers. ρ is a scalar used to make ||¯b|| 2 = ||f( ˆQ)|| 2 .In Item 1, relative perturbations are generated. Using this type of perturbationchanges <strong>the</strong> initial chosen solution ˆQ. That is, <strong>the</strong> generated ˆQ is not a minimum(critical point) to <strong>the</strong> optimization problem. By using Item 2, <strong>the</strong> initialsolution ˆQ will always be a critical point with residual γ¯b (but not necessarilya minimum). γ is here used to make <strong>the</strong> norm of <strong>the</strong> residual proportional to<strong>the</strong> norm of f( ˆQ). For instance, using γ = 0.1 means that <strong>the</strong> magnitude of <strong>the</strong>residual is 10% of <strong>the</strong> magnitude of <strong>the</strong> function value f( ˆQ). For small values ofγ, ˆQ should still be a global minimum after adding <strong>the</strong> perturbation. Choosingtoo large values of γ often results in that ˆQ becomes a local minimum, saddlepoint or maximum. We want to add a small perturbation such that ˆQ is stillglobal minimum, but large enough to cause trouble. That is, not so small that<strong>the</strong> algorithm becomes 100% successful in computing <strong>the</strong> global (generated)minimum ˆQ.5.1 Relative perturbationsThe tables in Appendix C.1 display results <strong>for</strong> different dimensions m, n <strong>and</strong>F ∈ R mn×mn using relative perturbations (item 1 above). For a given conditionnumber κ(F) <strong>and</strong> noise level γ, 100 tests are r<strong>and</strong>omly generated. The tablesdisplay <strong>the</strong> average number of iterations, <strong>and</strong> <strong>the</strong> corresponding st<strong>and</strong>ard deviationinside paren<strong>the</strong>sis. For example, <strong>for</strong> m = 3 <strong>and</strong> n = 2 with κ(F) = 5 <strong>and</strong>γ = 0.01 (1% relative noise level), 100 tests were done. The average numberof iterations needed to compute a solution is 3.77, with a st<strong>and</strong>ard deviation of0.42.


<strong>Algorithms</strong> <strong>for</strong> Linear Least Squares problems on <strong>the</strong> Stiefel manifold 655.2 Null space perturbationsFor a given dimension m, n <strong>and</strong> noise level γ, each table in Appendix C.2.1corresponds to a set of tests where <strong>the</strong> condition number of F ∈ R mn×mn isvaried. For each condition number, 100 test problems were generated with aperturbation according to item 2 above. The tables in contain <strong>the</strong> followingin<strong>for</strong>mation.κ(F) Condition number <strong>for</strong> <strong>the</strong> matrix F.Iterations The average number of iterations, <strong>and</strong> <strong>the</strong> correspondingst<strong>and</strong>ard deviation inside paren<strong>the</strong>sis (computed after running100 test problems).Fails Contains <strong>the</strong> number of tests that resulted in a nonminimumsolution, due to exceeding 100 iterations. It isexpected that some generated problems can yield very slowconvergence, hence <strong>the</strong> algorithm was set to terminate at100 iterations.New min For a given test problem generated with <strong>the</strong> exact solutionˆQ, let ¯Q be <strong>the</strong> solution computed by <strong>the</strong> algorithm. Thenumber of tests that resulted in that ˆQ ≠ ¯Q is shown here.This was done by checking if || ˆQ − ¯Q|| F > 10 −4 .Not global Shows <strong>the</strong> number of tests when ||f( ˆQ) − b|| 2 < ||f( ¯Q) −b|| 2 occurred. That is, <strong>the</strong> computed solution resulted in agreater residual norm than <strong>the</strong> generated solution.The ideal results are, e.g., those shown in Table 14. The ’Fails’ column with justzeroes, indicates that <strong>the</strong> algorithm managed to compute a minimum to all testproblems. The column ’New min’ indicates that <strong>the</strong> computed solution is <strong>the</strong>same as <strong>the</strong> generated solution. Also, e.g., Table 13 shows good results. Here<strong>the</strong> computed solutions differ on several occasions from <strong>the</strong> generated solutions,seen in <strong>the</strong> column ’New min’. However, only a few of <strong>the</strong> computed solutionsresulted in a greater residual norm, seen in <strong>the</strong> ’Not global’ column.5.2.1 Tests with non-square FFor Q ∈ R m×n <strong>and</strong> F ∈ R k×mn , <strong>the</strong> computational experiments in previoussections used k = mn (resulting in that F is a square matrix). Here we consider<strong>the</strong> case when k < mn, with perturbations according to item 2. For <strong>the</strong> results,shown in Appendix C.2.2, k = mn − n was used. As earlier, <strong>for</strong> each conditionnumber κ(F), 100 tests were made. The tables show <strong>the</strong> same in<strong>for</strong>mation asdescribed in Section 5.2.6 Summary of computational experimentsThe computational experiments presented in Appendix C, show that <strong>the</strong> algorithmis efficient in computing a solution to (1). Tables indicates that around


66 Paper II5−15 iterations were needed on an average, depending on <strong>the</strong> problem dimension<strong>and</strong> noise level.When it comes to <strong>the</strong> success rate of computing <strong>the</strong> global solution, <strong>the</strong> algorithmseems quite successful. First of all, no global optimization algorithm hasbeen used during <strong>the</strong> experiments. By using small γ values <strong>and</strong> perturbationsaccording to Item 2 above, ˆQ should most often be global minimizer. Typicallythis was <strong>the</strong> case when using γ = 0.05 <strong>and</strong> γ = 0.1, while <strong>for</strong> γ = 0.2 ˆQ wouldmore often become a local minimum, saddle point or maximum.For test problems generated in Appendix C.2.1, with m = 6, 10, 12, <strong>the</strong>algorithm most often computed a solution ¯Q, better 2 than or same as ˆQ. Ra<strong>the</strong>rsurprisingly <strong>the</strong> worst results appear in <strong>the</strong> low-dimensional case with m = 3,where around 10% of <strong>the</strong> computed solutions yielded a greater residual normthan ˆQ. For all tests, using γ = 0.2 quite often resulted in that ˆQ was not aglobal minimum. By subtracting <strong>the</strong> ’Not global’ column from <strong>the</strong> ’New min’column, <strong>the</strong> number of test problems where ||f( ¯Q) − b|| 2 < ||f( ˆQ) − b|| 2 isgiven. Typically ˆQ would become a saddle point most of <strong>the</strong> cases, when <strong>the</strong>perturbation was added.For a non-square F, in Appendix C.2.2, <strong>the</strong> tables show a noticeable increasein <strong>the</strong> number of average iterations. In <strong>the</strong> tables with m = 6 <strong>and</strong> n = 5,10% − 40% of <strong>the</strong> experiments resulted in that <strong>the</strong> computed solution yieldeda greater residual than ˆQ. However, when looking at <strong>the</strong> tables with m = 10<strong>and</strong> n = 4 many of <strong>the</strong> tests resulted in that ˆQ became a local minimum(or maximum/saddle-point) after adding <strong>the</strong> perturbation. And <strong>the</strong> computedsolution yielded a smaller residual norm. Even though <strong>the</strong>se problems are ofdifferent dimensions, 6 × 5 <strong>and</strong> 10 × 4, it is not clear why <strong>the</strong> results are quitevarying in this sense.Never<strong>the</strong>less, in total 41800 tests are presented here. Out of <strong>the</strong>se tests, 5resulted in that <strong>the</strong> algorithm terminated due to more than 100 iterations wereper<strong>for</strong>med (without fulfilling <strong>the</strong> desired tolerance). Since <strong>the</strong> algorithm usesGauss-Newton steps, unless <strong>the</strong> Hessian J T J + H is positive definite, this canin some cases (with large residuals), result in slow convergence. Specially if <strong>the</strong>computed initial matrix Q 0 is a bad starting value. However, on <strong>the</strong> total, <strong>for</strong><strong>the</strong>se tests, it was a rare scenario.AppendixA The canonical <strong>for</strong>m of a WOPPProposition A.1 The matrices A ∈ R mA×m <strong>and</strong> X ∈ R n×nX with Rank(A) =m <strong>and</strong> Rank(X) = n belonging to a WOPPmin 1 2 ||AQX − B||2 F , subject to Q T Q = I n ,can always be considered as m by m <strong>and</strong> n by n diagonal matrices, respectively.2 Better in <strong>the</strong> sense that ¯Q resulted in a smaller residual.


<strong>Algorithms</strong> <strong>for</strong> Linear Least Squares problems on <strong>the</strong> Stiefel manifold 67be <strong>the</strong> singular value decom-Proof. Let A = U A Σ A VA T <strong>and</strong> X = U XΣ X VX Tposition of A <strong>and</strong> X. Then||U A Σ A V T A QU B Σ B V T B − B|| 2 F = ||U A Σ A ZΣ X V T X − B|| 2 F,where Z = V T A QU X ∈ R m×n has orthonormal columns. Since U T A U A = I mA<strong>and</strong> V T X V X = I nX it follows that||U A Σ A ZΣ X V T X − B||2 F = tr(U AΣ A ZΣ X V T X − B)T (U A Σ A ZΣ X V T X − B) == tr(V X Σ X Z T Σ 2 AZΣ X V T X − 2V X Σ X Z T Σ 2 AU T AB + B T B) == tr(Σ X Z T Σ 2 AZΣ X ) − tr(2Σ B Z T Σ A U T ABV X ) + tr(B T B) =tr(Σ A ZΣ X − U T A BV X) T (Σ A ZΣ X − U T A BV X) = ||Σ A ZΣ X − U T A BV X|| 2 F .Hence, without loss of generality we can assume that A = diag(α 1 , ..., α m ) <strong>and</strong>X = diag(χ 1 , ..., χ n ) with α i ≥ α i+1 ≥ 0 <strong>and</strong> χ i ≥ χ i+1 ≥ 0. ✷B Parametrization of V m,n by using <strong>the</strong> Cayleytrans<strong>for</strong>mThe Cayley is often used to represent orthogonal matrices with positive determinantsasQ(S) = (I + S)(I − S) −1 , (11)where S ∈ R m×m is skew-symmetric S = −S T . Since a skew-symmetric matrixhas imaginary eigenvalues, (I − S) always has full rank. However, thisparametrization fails in some cases, namely when ( ˜Q + I) is singular. As anexample, <strong>the</strong>re exist no S ∈ R 2×2 such that Q(S) = diag(−1, −1). Instead ofusing (11) as a parametrization of orthogonal matrices, a local parametrizationcan be used. Given a point ˜Q ∈ V m,n , we can express any Q ∈ V m,m in <strong>the</strong>vicinity of ˜Q by usingQ(S) = ˜Q(I + S)(I − S) −1 . (12)To get a local parametrization of V m,n when n ≤ m, (12) is modified accordingto <strong>the</strong> following. Given a point ˜Q ∈ V m,n , <strong>the</strong>n a parametrization <strong>for</strong> anyQ ∈ V m,n in <strong>the</strong> vicinity of ˜Q can be written asQ(S) = [ ˜Q, ˜Q ⊥ ](I + S)(I − S) −1 I m,n . (13)Here ˜Q ⊥ is any extension such that [ ˜Q, ˜Q ⊥ ] ∈ R m×m is orthogonal <strong>and</strong>[ ]InI m,n = ∈ R m×n .0S is skew-symmetric according to[S11 −S21S =T S 21 0], (14)


68 Paper IIwhere S 11 ∈ R n×n is skew-symmetric <strong>and</strong> S 21 ∈ R m×n is arbitrary. The remaininglower right part in S is a zero matrix. Observe that if m = n, <strong>the</strong>n(13) is <strong>the</strong> same as (12).B.1 Search directions with Cayley representationFor a given search direction s at a point ˜Q, moving along <strong>the</strong> surface of f(Q)can be done by usingQ(φ) = [ ˜Q, ˜Q ⊥ ]C φ (φ)I m,n ,where=C φ (φ) =˜p∑j=1˜p∑j=1U j[cos(φj ) − sin(φ j )sin(φ j ) cos(φ j )[(cos(φ j )U j Uj H 0 −1+ sin(φ j )U j1 0]U T j = (15)]U T j ). (16)By using <strong>the</strong> spectral decomposition of S, S = WDW H , <strong>the</strong> decompositionC φ (φ) = UΦ(φ)U T is derived [6]. U ∈ R m×m is orthogonal,Φ(φ) ={ diag(Φ1 , Φ 2 , ..., Φ m/2 ) if m is even ⇒ ˜p = m/2,diag(Φ 1 , Φ 2 , ..., Φ (m−1)/2 , 1) o<strong>the</strong>rwise ⇒ ˜p = (m − 1)/2 + 1,whereΦ i =Now using (16) to express C φ (φ) yieldsf(Q(φ)) =˜p∑j=1[ cos(φi ) − sin(φ i )sin(φ i ) cos(φ i )].[(cos(φ j )f( ˜QU j Uj T I m,n ) + sin(φ)f( ˜QU 0 −1j1 0]U T j I m,n )) =cos(φ 1 )f 1,cos + sin(φ 1 )f 1,sin + ... + cos(φ˜p )f˜p,cos + sin(φ˜p )f˜p,sin .The optimal C φ (φ) is given by solving <strong>the</strong> least squares problem[ ] [ ] [ ]cosφ1 cosφ2min ||A f1 + Aφ sin φf2 + . . . + A cosφ˜p1 sin φf˜p− b|| 222, (17)sinφ˜pwhere A fi = [f i,cos , f i,sin ] ∈ R k×2 .Two different approaches to solve this subproblem are considered. A traditionalGauss Newton or Newton method can be used to solve (17). However,empirical studies have shown that <strong>the</strong> Jacobian matrix of f(Q(φ)) can occasionallybecome ill conditioned. Hence using a Gauss-Newton method can result inslow convergence. Switching to a Newton method <strong>the</strong>n might result in convergencetowards a maximum. Since <strong>the</strong> parameters φ i are periodic, a large searchdirection when solving (17) can result in a seemingly r<strong>and</strong>omized step.


<strong>Algorithms</strong> <strong>for</strong> Linear Least Squares problems on <strong>the</strong> Stiefel manifold 69In <strong>the</strong> cases when Newton type algorithms fail, a coordinate-wise search canbe used. This is done by keeping every angle but one φ i ∈ φ fix, <strong>and</strong> use it as aminimizer. Then repeating this <strong>for</strong> all angles φ j ∈ φ, j = 1, ..., ˜p.Algorithm: Coordinate-wise search0. Given a search direction s (S), compute U.1. Set φ 1 = φ 2 = ... = φ˜p = 0.2. While φ is not a minimizer to (17)2.1. <strong>for</strong> i = 1 to ˜p2.1.1.c =˜p∑j=1,j≠i2.1.2. Let φ i be <strong>the</strong> solution of2.2. end <strong>for</strong>2.3 end While[ ]cosφjA fjsin φ j[ ]cosφimin ||A fi − (b − c)||φ i sin φ 2 2 . (18)iThe subproblem (18) is solved optimally by computing all solutions to afourth degree polynomial, see [13]. This is a very robust method in order tominimize (17), but <strong>for</strong> larger problems it can be a time consuming task. Also ithas shown to result in too short step lengths, resulting in a slow convergence.


70 Paper IICResults from computational experimentsC.1 Tables <strong>for</strong> <strong>the</strong> relative type of perturbationsκ(F) γ = 0.001 γ = 0.01 γ = 0.1 γ = 0.2 γ = 0.52 3 ( 0 ) 3.13 ( 0.34 ) 3.77 ( 0.42 ) 3.92 ( 0.37) 4.18 ( 0.52)5 3.15 ( 0.36 ) 3.77 ( 0.42 ) 4.21 ( 0.46 ) 4.4 ( 0.57) 4.73 ( 0.97)10 3.34 ( 0.48 ) 3.93 ( 0.48 ) 4.46 ( 0.63 ) 4.69 ( 0.72) 4.91 ( 1.06)50 3.69 ( 0.49 ) 4.13 ( 0.58 ) 5.13 ( 1.01 ) 5.15 ( 1.28) 5.18 ( 0.93)100 3.77 ( 0.6 ) 4.25 ( 0.73 ) 5.2 ( 1.05 ) 5.39 ( 1.14) 5.14 ( 1.06)250 3.72 ( 0.6 ) 4.74 ( 1.28 ) 5.21 ( 1.09 ) 5.24 ( 1.06) 5.58 ( 1.61)500 3.92 ( 0.8 ) 4.87 ( 1.51 ) 5.4 ( 1.41 ) 5.76 ( 1.96) 5.15 ( 1.04)1000 3.87 ( 0.77 ) 4.84 ( 1.22 ) 5.26 ( 1.3 ) 5.66 ( 1.36) 5.55 ( 1.31)2500 4.15 ( 1.1 ) 5.14 ( 2.43 ) 5.27 ( 1.12 ) 5.53 ( 1.36) 5.35 ( 1.1)5000 4.54 ( 1.53 ) 5.2 ( 1.37 ) 5.38 ( 1.43 ) 5.3 ( 1.12) 5.4 ( 1.56)10000 4.6 ( 1.34 ) 5.17 ( 1.41 ) 5.29 ( 1.13 ) 5.63 ( 1.76) 5.52 ( 1.32)Table 1: m = 3, n = 2.κ(F) γ = 0.001 γ = 0.01 γ = 0.1 γ = 0.2 γ = 0.52 3.05 ( 0.5 ) 3.29 ( 0.52 ) 4.00 ( 0 ) 4 ( 0) 4.27 ( 0.49)5 3.19 ( 0.39 ) 4 ( 0 ) 4.4 ( 0.49 ) 4.94 ( 0.51) 5.46 ( 0.77)10 3.71 ( 0.46 ) 4 ( 0 ) 4.95 ( 0.5 ) 5.35 ( 0.63) 6.07 ( 0.96)50 3.98 ( 0.14 ) 4.44 ( 0.54 ) 5.66 ( 0.79 ) 6.26 ( 1.38) 7.07 ( 3.27)100 4.02 ( 0.25 ) 4.75 ( 0.61 ) 5.79 ( 1.15 ) 6.32 ( 1.65) 7.13 ( 3.08)250 4.1 ( 0.3 ) 4.99 ( 0.82 ) 5.88 ( 0.79 ) 6.32 ( 1.06) 6.95 ( 2.04)500 4.11 ( 0.37 ) 5.12 ( 0.83 ) 5.93 ( 0.87 ) 6.59 ( 2.18) 7.33 ( 2.31)1000 4.27 ( 0.57 ) 5.19 ( 0.85 ) 5.95 ( 1.91 ) 6.44 ( 1.56) 7.45 ( 3.34)2500 4.59 ( 0.93 ) 5.18 ( 0.78 ) 5.81 ( 0.66 ) 6.28 ( 1.16) 7.11 ( 2.7)5000 4.63 ( 0.86 ) 5.48 ( 1.18 ) 6.08 ( 1.35 ) 6.2 ( 1.06) 6.76 ( 1.24)10000 4.92 ( 1.01 ) 5.29 ( 1 ) 5.89 ( 0.91 ) 6.33 ( 1.23) 6.89 ( 1.49)Table 2: m = 6, n = 5.


<strong>Algorithms</strong> <strong>for</strong> Linear Least Squares problems on <strong>the</strong> Stiefel manifold 71κ(F) γ = 0.001 γ = 0.01 γ = 0.1 γ = 0.2 γ = 0.52 3.07 ( 0.7 ) 3.4 ( 0.6 ) 4 ( 0) 4 ( 0 ) 4.17 ( 0.43 )5 3.41 ( 0.49 ) 4 ( 0 ) 4.72 ( 0.45) 5.01 ( 0.33 ) 5.58 ( 0.67 )10 3.99 ( 0.1 ) 4.02 ( 0.14 ) 5.3 ( 0.48) 5.83 ( 0.57 ) 6.19 ( 1.24 )50 4.01 ( 0.1 ) 4.98 ( 0.53 ) 6.3 ( 0.7) 6.51 ( 0.87 ) 7.32 ( 2.06 )100 4.07 ( 0.26 ) 5.28 ( 0.65 ) 6.41 ( 0.79) 6.7 ( 1.11 ) 7.47 ( 2.02 )250 4.32 ( 0.47 ) 5.48 ( 0.89 ) 6.44 ( 0.9) 7.01 ( 1.45 ) 6.86 ( 1.1 )500 4.33 ( 0.49 ) 5.74 ( 0.8 ) 6.54 ( 0.88) 6.79 ( 1.08 ) 7.11 ( 1.54 )1000 4.67 ( 0.74 ) 5.63 ( 0.79 ) 6.53 ( 0.81) 7.02 ( 2.16 ) 7.58 ( 2.77 )2500 4.73 ( 0.85 ) 5.87 ( 0.98 ) 6.64 ( 1.19) 6.95 ( 1.4 ) 7.44 ( 2.28 )5000 5.23 ( 1.04 ) 5.82 ( 0.98 ) 6.65 ( 0.89) 6.85 ( 1.1 ) 7.06 ( 1.51 )10000 5.16 ( 1.13 ) 5.87 ( 1.02 ) 6.6 ( 1.02) 6.73 ( 1.05 ) 7.29 ( 2.26 )Table 3: m = 12, n = 5.κ(F) γ = 0.001 γ = 0.01 γ = 0.1 γ = 0.2 γ = 0.52 3.04 ( 0.4 ) 3.53 ( 0.61 ) 4 ( 0 ) 4 ( 0 ) 4.36 ( 0.48 )5 3.29 ( 0.46 ) 4 ( 0 ) 4.7 ( 0.46 ) 5.03 ( 0.22 ) 6.15 ( 1.77 )10 3.99 ( 0.1 ) 4 ( 0 ) 5.09 ( 0.29 ) 5.58 ( 0.54 ) 6.39 ( 1.11 )50 4 ( 0 ) 4.69 ( 0.46 ) 5.97 ( 0.63 ) 6.41 ( 0.98 ) 8.65 ( 7.51 )100 4 ( 0 ) 4.95 ( 0.41 ) 6.01 ( 0.58 ) 6.64 ( 1.09 ) 7.75 ( 1.95 )250 4.08 ( 0.27 ) 5.2 ( 0.68 ) 6.18 ( 0.67 ) 6.6 ( 0.94 ) 7.8 ( 3.24 )500 4.19 ( 0.42 ) 5.48 ( 0.81 ) 6.26 ( 0.75 ) 6.39 ( 0.71 ) 7.74 ( 3.02 )1000 4.39 ( 0.62 ) 5.44 ( 0.72 ) 6.13 ( 0.65 ) 6.48 ( 0.9 ) 8.13 ( 5.31 )2500 4.64 ( 0.82 ) 5.62 ( 0.76 ) 6.25 ( 0.69 ) 6.6 ( 0.9 ) 7.59 ( 2.82 )5000 4.76 ( 0.81 ) 5.39 ( 0.72 ) 6.1 ( 0.73 ) 6.59 ( 0.94 ) 8.36 ( 4.15 )10000 5.09 ( 1 ) 5.51 ( 0.83 ) 6.07 ( 0.54 ) 6.41 ( 0.65 ) 8 ( 2.86 )Table 4: m = 10, n = 7.


72 Paper IIC.2 Tables <strong>for</strong> null space perturbationsC.2.1 Using a square matrix F ∈ R mn×mnκ(F) Iterations Fails New min Not global2 3.87 ( 0.33 ) 0 0 05 4.14 ( 0.40 ) 0 0 010 4.52 ( 0.78 ) 0 1 050 5.04 ( 0.97 ) 0 10 6100 5.2 ( 1.11 ) 0 12 7250 5.52 ( 1.47 ) 0 17 11500 5.51 ( 1.39 ) 0 16 131000 5.44 ( 1.64 ) 0 19 142500 5.47 ( 1.1 ) 0 16 115000 5.58 ( 1.30 ) 0 20 1610000 5.69 ( 1.48 ) 0 15 12Table 5: m = 3, n = 2, γ = 0.05.κ(F) Iterations Fails New min Not global2 3.98 ( 0.14 ) 0 0 05 4.48 ( 0.59 ) 0 1 010 4.99 ( 0.85 ) 0 3 050 5.44 ( 1.18 ) 0 24 17100 5.75 ( 2.16 ) 0 22 10250 5.61 ( 1.18 ) 0 22 11500 5.61 ( 1.19 ) 0 27 141000 5.47 ( 1.43 ) 0 26 112500 5.35 ( 1.09 ) 0 20 165000 5.66 ( 1.65 ) 0 17 1310000 5.74 ( 1.34 ) 0 26 13Table 6: m = 3, n = 2, γ = 0.1.


<strong>Algorithms</strong> <strong>for</strong> Linear Least Squares problems on <strong>the</strong> Stiefel manifold 73κ(F) Iterations Fails New min Not global2 4.05 ( 0.3 ) 0 0 05 4.87 ( 0.93 ) 0 4 110 5.35 ( 0.99 ) 0 18 450 5.48 ( 1.34 ) 0 32 13100 5.45 ( 1.24 ) 0 33 12250 5.76 ( 1.56 ) 0 32 7500 5.57 ( 1.17 ) 0 27 121000 5.93 ( 1.52 ) 0 35 152500 5.84 ( 1.42 ) 0 37 115000 5.73 ( 1.48 ) 0 32 910000 5.59 ( 1.78 ) 0 22 4Table 7: m = 3, n = 2, γ = 0.2.κ(F) Iterations Fails New min Not global2 4 ( 0 ) 0 0 05 4.23 ( 0.42 ) 0 0 010 4.84 ( 0.37 ) 0 0 050 5.55 ( 0.69 ) 0 0 0100 5.78 ( 0.91 ) 0 0 0250 5.77 ( 0.96 ) 0 0 0500 5.77 ( 0.81 ) 0 0 01000 5.85 ( 0.9 ) 0 1 12500 5.84 ( 1.1 ) 0 1 15000 5.83 ( 1.05 ) 0 0 010000 5.78 ( 0.81 ) 0 2 2Table 8: m = 6, n = 5, γ = 0.05.κ(F) Iterations Fails New min Not global2 4 ( 0 ) 0 0 05 4.85 ( 0.41 ) 0 0 010 5.2 ( 0.45 ) 0 0 050 6.01 ( 0.72 ) 0 0 0100 6.12 ( 1.1 ) 0 2 2250 6.15 ( 1.31 ) 0 1 1500 6.21 ( 0.86 ) 0 0 01000 6.17 ( 0.82 ) 0 1 02500 6.26 ( 1.13 ) 0 3 25000 5.98 ( 0.68 ) 0 2 110000 6.26 ( 1.57 ) 0 0 0Table 9: m = 6, n = 5, γ = 0.1.


74 Paper IIκ(F) Iterations Fails New min Not global2 4.04 ( 0.2 ) 0 0 05 5.13 ( 0.49 ) 0 0 010 5.89 ( 0.97 ) 0 3 150 6.58 ( 1.22 ) 0 11 1100 6.54 ( 1.06 ) 0 12 2250 7.11 ( 1.79 ) 0 13 2500 6.81 ( 2.67 ) 0 10 21000 6.93 ( 1.25 ) 0 7 42500 7.53 ( 2.78 ) 0 19 65000 7.07 ( 1.78 ) 0 17 810000 7.22 ( 1.89 ) 0 13 1Table 10: m = 6, n = 5, γ = 0.2.κ(F) Iterations Fails New min Not global2 4 ( 0 ) 0 0 05 4.8 ( 0.4 ) 0 0 010 5.3 ( 0.5 ) 0 0 050 6.37 ( 0.95 ) 0 2 0100 6.59 ( 1.2 ) 0 1 0250 6.97 ( 1.7 ) 0 4 1500 6.81 ( 1.15 ) 0 1 01000 6.68 ( 1.02 ) 0 6 22500 6.77 ( 1.04 ) 0 4 05000 6.96 ( 1.4 ) 0 4 010000 6.96 ( 1.3 ) 0 4 0Table 11: m = 12, n = 5, γ = 0.05.κ(F) Iterations Fails New min Not global2 4 ( 0 ) 0 0 05 5.13 ( 0.37 ) 0 0 010 5.99 ( 0.58 ) 0 0 050 7.41 ( 2.35 ) 0 14 1100 7.78 ( 2.4 ) 0 26 0250 7.77 ( 2.73 ) 0 23 0500 7.88 ( 1.82 ) 0 22 01000 7.77 ( 2.08 ) 0 25 12500 8.27 ( 3.05 ) 0 22 15000 8.01 ( 1.95 ) 0 27 110000 8.09 ( 2.61 ) 0 27 4Table 12: m = 12, n = 5, γ = 0.1.


<strong>Algorithms</strong> <strong>for</strong> Linear Least Squares problems on <strong>the</strong> Stiefel manifold 75κ(F) Iterations Fails New min Not global2 4.22 ( 0.42 ) 0 0 05 6.12 ( 1.27 ) 0 4 110 8.06 ( 3.64 ) 0 27 050 9.5 ( 4.24 ) 0 70 2100 10.09 ( 7.04 ) 0 67 1250 8.91 ( 2.87 ) 0 69 3500 8.93 ( 2.09 ) 2 61 11000 9.34 ( 4.38 ) 0 63 02500 10.28 ( 6.75 ) 0 65 25000 10.16 ( 5.02 ) 0 72 110000 9.47 ( 4.73 ) 0 68 1Table 13: m = 12, n = 5, γ = 0.2.κ(F) Iterations Fails New min Not global2 4 ( 0 ) 0 0 05 4.6 ( 0.49 ) 0 0 010 5.03 ( 0.17 ) 0 0 050 5.9 ( 0.61 ) 0 0 0100 6.09 ( 0.64 ) 0 0 0250 6.09 ( 0.6 ) 0 0 0500 6.03 ( 0.64 ) 0 0 01000 6.14 ( 0.64 ) 0 0 02500 6.19 ( 0.54 ) 0 0 05000 6.05 ( 0.52 ) 0 0 010000 6.23 ( 0.58 ) 0 0 0Table 14: m = 10, n = 5, γ = 0.05.κ(F) Iterations Fails New min Not global2 4 ( 0 ) 0 0 05 5.01 ( 0.1 ) 0 0 010 5.65 ( 0.61 ) 0 0 050 6.13 ( 0.56 ) 0 0 0100 6.54 ( 0.86 ) 0 2 0250 6.44 ( 0.88 ) 0 0 0500 6.56 ( 0.81 ) 0 0 01000 6.63 ( 0.91 ) 0 0 02500 6.62 ( 0.83 ) 0 2 05000 6.45 ( 0.77 ) 0 1 010000 6.61 ( 1.38 ) 0 2 0Table 15: m = 10, n = 5, γ = 0.1.


76 Paper IIκ(F) Iterations Fails New min Not global2 4.03 ( 0.17 ) 0 0 05 5.68 ( 0.55 ) 0 0 010 6.56 ( 0.98 ) 0 5 050 8.99 ( 4.06 ) 0 20 1100 9.02 ( 5.49 ) 0 22 5250 10.67 ( 10.24 ) 0 31 4500 9.44 ( 9.4 ) 0 28 11000 8.48 ( 2.89 ) 0 33 02500 8.04 ( 2.13 ) 0 27 45000 9.24 ( 5.15 ) 0 26 210000 8.73 ( 3.52 ) 0 24 1Table 16: m = 10, n = 5, γ = 0.2.C.2.2 Using non-square matrix F ∈ R (mn−n)×mnκ(F) Iterations Fails New min Not global2 7.25 ( 1.48 ) 0 9 95 7.65 ( 1.47 ) 0 10 1010 9.29 ( 4.78 ) 0 10 1050 10.03 ( 3.25 ) 0 18 17100 11.9 ( 4.44 ) 0 15 14250 11.81 ( 4.08 ) 0 24 23500 12.76 ( 3.95 ) 0 31 281000 13.34 ( 4.34 ) 0 36 342500 13.33 ( 3.83 ) 0 35 355000 14.08 ( 4.38 ) 0 25 2410000 13.65 ( 3.88 ) 0 35 33Table 17: m = 6, n = 5, γ = 0.05.


<strong>Algorithms</strong> <strong>for</strong> Linear Least Squares problems on <strong>the</strong> Stiefel manifold 77κ(F) Iterations Fails New min Not global2 7.36 ( 1.84) 0 8 85 8.86 ( 2.78) 0 8 810 9.67 ( 3.6) 0 15 1150 10.7 ( 3.34) 0 39 28100 12.44 ( 4.53) 0 35 24250 13.47 ( 4.39) 0 43 33500 13.82 ( 4.56) 0 43 331000 14.72 ( 7.26) 0 36 292500 14.52 ( 5.25) 0 47 355000 14.48 ( 5.23) 0 41 3510000 14.94 ( 4.93) 0 42 31Table 18: m = 6, n = 5, γ = 0.1.κ(F) Iterations Fails New min Not global2 8.01 ( 2.63 ) 0 10 75 9.24 ( 2.79 ) 0 26 1010 10.11 ( 3.2 ) 0 46 1950 12.18 ( 4.93 ) 1 60 26100 13.21 ( 3.81 ) 0 58 33250 14.91 ( 7.41 ) 0 71 36500 13.16 ( 5.52 ) 0 67 291000 12.92 ( 3.52 ) 0 62 272500 14.73 ( 5.48 ) 0 58 325000 15.05 ( 8.87 ) 0 67 3810000 14.47 ( 5.44 ) 0 61 32Table 19: m = 6, n = 5, γ = 0.2.κ(F) Iterations Fails New min Not global2 7.35 ( 1.45 ) 0 2 25 8.21 ( 1.47 ) 0 4 010 8.92 ( 1.79 ) 0 14 350 11.57 ( 3.21 ) 0 42 9100 12.47 ( 3.25 ) 0 53 7250 14.38 ( 4.49 ) 0 50 11500 13.81 ( 3.06 ) 0 40 41000 15.37 ( 4.99 ) 0 39 82500 14.77 ( 3.54 ) 0 38 45000 14.79 ( 3.87 ) 0 49 910000 15.18 ( 4.99 ) 0 45 9Table 20: m = 10, n = 4, γ = 0.05.


78 Paper IIκ(F) Iterations Fails New min Not global2 7.55 ( 1.36 ) 0 19 25 8.96 ( 5.19 ) 0 29 110 9.54 ( 3.19 ) 0 46 650 11.49 ( 2.49 ) 0 62 2100 13.09 ( 3.87 ) 0 55 3250 13.72 ( 5.38 ) 0 68 4500 14.2 ( 3.53 ) 0 59 61000 14.15 ( 3.66 ) 0 69 62500 14.6 ( 4.3 ) 0 74 95000 14.38 ( 4.37 ) 0 59 410000 14.78 ( 3.49 ) 0 74 7Table 21: m = 10, n = 4, γ = 0.1.κ(F) Iterations Fails New min Not global2 8.63 ( 3.83 ) 0 60 25 9.39 ( 3.44 ) 0 70 210 10.29 ( 3.26 ) 0 85 550 12.42 ( 3.53 ) 0 79 4100 13.61 ( 4.69 ) 0 79 0250 14.38 ( 7.06 ) 0 88 3500 13.64 ( 3.12 ) 0 81 31000 15.04 ( 4.18 ) 1 85 22500 14.79 ( 6.26 ) 0 85 35000 15.51 ( 9.29 ) 0 81 210000 13.62 ( 3.09 ) 1 78 2Table 22: m = 10, n = 4, γ = 0.2.References[1] M. T. Chu <strong>and</strong> N. T. Trendafilov. On a Differential Equation Approach to<strong>the</strong> <strong>Weighted</strong> <strong>Orthogonal</strong> <strong>Procrustes</strong> <strong>Problem</strong>. Statistics <strong>and</strong> Computing,8(2):125–133, 1998.[2] M. T. Chu <strong>and</strong> N. T. Trendafilov. The <strong>Orthogonal</strong>ly Constrained RegressionRevisted. J. Comput. Graph. Stat., 10:746–771, 2001.[3] A. Edelman, T. A. Arias, <strong>and</strong> S. T. Smith. The Geometry of <strong>Algorithms</strong>with <strong>Orthogonal</strong>ity Constraints. SIAM Journal on Matrix Analysis <strong>and</strong>Applications, 20(2):303–353, 1998.[4] L. Eldén <strong>and</strong> H. Park. A <strong>Procrustes</strong> problem on <strong>the</strong> Stiefel manifold.Numer. Math., 82(4):599–619, 1999.[5] W. G<strong>and</strong>er. Least Squares with a Quadratic Constraint. Numer. Math.,36:291–307, 1981.


<strong>Algorithms</strong> <strong>for</strong> Linear Least Squares problems on <strong>the</strong> Stiefel manifold 79[6] P. R. Halmos. Finite-dimensional vector spaces. Van Nostr<strong>and</strong>, 1958.[7] M. A. Koschat <strong>and</strong> D. F. Swayne. A Weig<strong>the</strong>d <strong>Procrustes</strong> Criterion. Psychometrika,56(2):229–239, 1991.[8] A. Mooijaart <strong>and</strong> J. J. F. Comm<strong>and</strong>eur. A General Solution of <strong>the</strong> Weig<strong>the</strong>dOrthonormal <strong>Procrustes</strong> <strong>Problem</strong>. Psychometrika, 55(4):657–663, 1990.[9] T. Rapcsak. On Minimization on Stiefel manifolds. European J. Oper.Res., 143(2):365–376, 2002.[10] I. Söderkvist. Some Numerical Methods <strong>for</strong> Kinematical Analysis. ISSN-0348-0542, UMINF-186.90, Department of Computing Science, Umeå University,1990.[11] I. Söderkvist <strong>and</strong> Per-Åke Wedin. On Condition Numbers <strong>and</strong> <strong>Algorithms</strong><strong>for</strong> Determining a Rigid Body Movement. BIT, 34:424–436, 1994.[12] E. Stiefel. Richtungsfelder und Fernparallelismus in n-dimensionalen Mannigfaltigkeiten.Commentarii Math. Helvetici, 8:305–353, 1935-1936.[13] P. Å . Wedin <strong>and</strong> T. Vikl<strong>and</strong>s. <strong>Algorithms</strong> <strong>for</strong> 3-dimensional <strong>Weighted</strong><strong>Orthogonal</strong> <strong>Procrustes</strong> <strong>Problem</strong>s. Technical Report UMINF-06.06, Departmentof Computing Science, Umeå University, Umeå, Sweden, 2006.


80 Paper II


Paper IIIOn <strong>the</strong> Number of Minima to <strong>Weighted</strong><strong>Orthogonal</strong> <strong>Procrustes</strong> <strong>Problem</strong>s ∗Thomas Vikl<strong>and</strong>s †Department of Computing Science, Umeå UniversitySE-901 87 Umeå, Sweden.vikl<strong>and</strong>s@cs.umu.seAbstractA weighted orthogonal <strong>Procrustes</strong> problem (WOPP) min ||AQX −B|| 2 F, subject to Q T Q = I n, where Q ∈ R m×n with n ≤ m, can haveseveral local minima. Hence some global optimization technique is oftenneeded in order to find <strong>the</strong> global minimum. This contribution investigates<strong>the</strong> maximal number of minima to a WOPP, a useful knowledge when developinga global optimization algorithm. A natural first approach is tostudy <strong>the</strong> case when B = 0. It turns out that if A <strong>and</strong> X have strictlydecreasing singular values, <strong>the</strong>re exist exactly 2 n minima. By continuityreasoning it is shown that <strong>the</strong> amount of minima is conserved <strong>for</strong> smallperturbations B = 0+δB. Our conjecture is that no more than 2 n minimaexist <strong>for</strong> a WOPP.Keywords : <strong>Weighted</strong>, orthogonal, <strong>Procrustes</strong>, global minimum, Stiefel manifold,minima.∗ From UMINF-06.08, 2006. Submitted to BIT.† Financial support has partly been provided by <strong>the</strong> Swedish Foundation <strong>for</strong> Strategic Researchunder <strong>the</strong> frame program grant A3 02:128.83


84 Paper IIIContents1 Introduction. 852 2-norm <strong>for</strong>mulation 863 The tangent space of V m,n 864 Lagrangian <strong>for</strong>mulation 875 Why study <strong>the</strong> case when B = 0 875.1 The ellipsoid cases . . . . . . . . . . . . . . . . . . . . . . . . . . 885.2 Motivation of B = 0 in general cases . . . . . . . . . . . . . . . . 896 The B = 0 case 916.1 First order conditions <strong>and</strong> <strong>the</strong> critical points . . . . . . . . . . . . 916.2 Second order conditions <strong>and</strong> <strong>the</strong> minimum solutions . . . . . . . 926.3 Some cases with equal singular values . . . . . . . . . . . . . . . 967 Discussion of <strong>the</strong> general case B ≠ 0 967.1 Some examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . 977.2 A simple algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 988 Concluding remarks 99A The canonical <strong>for</strong>m of a WOPP 99B The solution to an OPP 100C Parametrization of V m,n by using <strong>the</strong> Cayley trans<strong>for</strong>m 101C.1 The Tangent space of V m,n . . . . . . . . . . . . . . . . . . . . . 101D Number of minima to <strong>the</strong> ellipsoid cases 102References 104


On <strong>the</strong> Number of Minima to <strong>Weighted</strong> <strong>Orthogonal</strong> <strong>Procrustes</strong> <strong>Problem</strong>s 851 Introduction.A weighted orthogonal <strong>Procrustes</strong> problem (WOPP) is an optimization problemthat arises in applications related to, e.g., multivariate analysis <strong>and</strong> multidimensionalscaling [5,12,13], <strong>and</strong> photogrammetry [1]. Typically it is aboutcomputing an optimal rotation when it is desired to match one set of data toano<strong>the</strong>r. Formally, a WOPP corresponds to computing a matrix Q ∈ R m×n ,where n ≤ m, with orthonormal columns that solves <strong>the</strong> minimization problemmin 1 2 ||AQX − B||2 F , subject to Q T Q = I n . (1)Here A ∈ R m×m , X ∈ R n×n <strong>and</strong> B ∈ R m×n are known matrices <strong>and</strong> || · || Fdenotes <strong>the</strong> Frobenius norm. We can assume that A <strong>and</strong> X are square diagonalmatrices A = diag(α 1 , ..., α m ) <strong>and</strong> X = diag(χ 1 , ..., χ n ), where α i ≥ α i+1 > 0<strong>and</strong> χ i ≥ χ i+1 > 0 respectively, see Appendix A. We call this <strong>the</strong> canonical<strong>for</strong>m of a WOPP. From now on we assume that A <strong>and</strong> X are diagonal matriceson this <strong>for</strong>m.Equation (1) is an optimization problem defined on <strong>the</strong> Stiefel manifold [19],V m,n = {Q ∈ R m×n : Q T Q = I n }.As in [8], we call (1) balanced if m = n <strong>and</strong> unbalanced if n < m. WithA = I m , (1) specializes to <strong>the</strong> orthogonal <strong>Procrustes</strong> problem (OPP)min 1 2 ||QX − B||2 F , subject to QT Q = I m . (2)If B has full rank, this problem has a unique minimum that can be derived from<strong>the</strong> singular value decomposition of XB T , see Appendix B.Consider a weighting of <strong>the</strong> residual QX − B <strong>for</strong> an OPP as A(QX − B),<strong>the</strong>n (2) becomesmin 1 2 ||A(QX − B)||2 F , subject to QT Q = I n . (3)By taking B := AB we get <strong>the</strong> optimization problem on <strong>the</strong> <strong>for</strong>m given in (1).Generally, a solution to (1) can not be computed as easily as in <strong>the</strong> OPPcases. An iterative method is needed. Earlier work in connection to iterativealgorithms <strong>and</strong> methods <strong>for</strong> solving problems similar to (1) is reported in [3,4,6,8, 10,14–18,22]. Moreover (1) can have several minima, also observed byo<strong>the</strong>rs [3, 4,8, 10,14,15].This paper investigates <strong>the</strong> maximal amount of minima to a WOPP. Todo this <strong>the</strong> special case when B = 0 is studied in detail by using a Lagrange<strong>for</strong>mulation of (1), <strong>and</strong> at <strong>the</strong> end some special <strong>and</strong> low-dimensional cases whenB ≠ 0 are considered. We start with some introductory definitions, <strong>for</strong>mulations<strong>and</strong> motivations.


86 Paper III2 2-norm <strong>for</strong>mulationIn later sections, we mainly consider <strong>the</strong> function AQX ∈ R m×n embedded inR mn . Usually this is done by using <strong>the</strong> vec-operator, which is <strong>the</strong> stacking of<strong>the</strong> columns in a matrix into a column vector. For example, with Q = [q 1 , ..., q n ]<strong>and</strong> Q ∈ R m×n , <strong>the</strong>nvec(Q) =⎡⎢⎣⎤q 1.⎥⎦ , vec(Q) ∈ R mn .q nAn equivalent problem <strong>for</strong>mulation of (1), but now in <strong>the</strong> 2-norm ismin 1 2 ||Fvec(Q) − vec(B)||2 2 , subject to Q T Q = I. (4)The diagonal matrix F ∈ R mn×mn is <strong>the</strong> Kronecker product of X T <strong>and</strong> A, i.e.,F = X T ⊗ A = diag(χ 1 A, ..., χ n A). More in<strong>for</strong>mation regarding this problem<strong>for</strong>mulation with algorithms can be found in [22] <strong>and</strong> [21].The surface of Fvec(Q) is addressed later on when considering some specialcases of a WOPP.Definition 2.1 Letdenote <strong>the</strong> surface of Fvec(Q) ∈ R mn .F = {y = Fvec(Q) | Q ∈ V m,n }3 The tangent space of V m,nHaving a parametrization of <strong>the</strong> Stiefel manifold, <strong>the</strong> tangent space of V m,n ata point ˜Q, is <strong>the</strong> set of all tangent directions. It is used in <strong>the</strong> following sectionswhen classifying critical points to a WOPP.Definition 3.1 The tangent space of <strong>the</strong> Stiefel manifold V m,n at a given point˜Q can be expressed asT = {T = ˜QS + (I − ˜Q ˜Q T )C} (5)where S = −S T ∈ R n×n is skew-symmetric <strong>and</strong> C ∈ R m×n arbitrary.To derive <strong>the</strong> expression of <strong>the</strong> tangent space, <strong>the</strong> Cayley trans<strong>for</strong>m of askew symmetric matrix can be used, see Appendix C.1. For more in<strong>for</strong>mationregarding <strong>the</strong> tangent space of V m,n , see also [6].


On <strong>the</strong> Number of Minima to <strong>Weighted</strong> <strong>Orthogonal</strong> <strong>Procrustes</strong> <strong>Problem</strong>s 874 Lagrangian <strong>for</strong>mulationFor later analysis, we use <strong>the</strong> Lagrangian <strong>for</strong>mulation of (4)L(Q, Λ) = 1 2 ||Fvec(Q) − vec(B)||2 2 + 1 2n∑λ i,i (qi T q i − 1) +i=1n∑λ i,j qi T q j.Here Λ denotes <strong>the</strong> set of all Lagrange parameters λ corresponding to <strong>the</strong> constraint(s)Q T Q = I. Λ can be considered as an n by n symmetric matrix wi<strong>the</strong>lements Λ i,j = λ i,j .The gradient of <strong>the</strong> Lagrangian with respect to Q is denoted⎡ ⎤∇ Q L =⎢⎣∇ q1 L.∇ qn L<strong>and</strong> can be written as n sets of m equations,∇ qi L = χ 2 iDq i + λ i,i q i +n∑λ j,i q j +j


88 Paper III5.1 The ellipsoid casesConsider <strong>the</strong> special case when Q ∈ R m×1 , studied by Forsy<strong>the</strong> <strong>and</strong> Golub [9],G<strong>and</strong>er [10] <strong>and</strong> Eldén [7], commonly written asmin ||Aq − b|| 2 2 , subject to qT q = 1. (10)In a geometric sense, (10) corresponds to determining <strong>the</strong> minimum distancebetween a hyper-ellipsoid in R m , determined by A, <strong>and</strong> <strong>the</strong> given point b.Let m = 2, by using <strong>the</strong> parameterization q = [cosφ, sin φ] T , an equivalent<strong>for</strong>mulation of (10) is <strong>the</strong>nmin ||[α1 cosφα 2 sin φ] [ ]b1− ||b 2 22.Assume that α 1 > α 2 , <strong>the</strong>n <strong>the</strong> optimization problem corresponds to find <strong>the</strong>point on <strong>the</strong> ellipsex = α 1 cosφy = α 2 sinφthat lies closest to <strong>the</strong> point b = [b 1 , b 2 ] T . There can at most be two minima tothis problem. It turns out that if b is inside <strong>the</strong> evolute 1 of <strong>the</strong> ellipsex e = α2 1 −α2 2α 1cos 3 φ,y e = α2 2 −α2 1α 2sin 3 φ,<strong>the</strong>n (10) has two minima, o<strong>the</strong>rwise it has just one.3210−1−2−3−2 −1.5 −1 −0.5 0 0.5 1 1.5 2Figure 1: An ellipse with α 1 = 2 <strong>and</strong> α 2 = 1 <strong>and</strong> its evolute.The global minimum is always in <strong>the</strong> same quadrant as b, while <strong>the</strong> localminimum is in <strong>the</strong> quadrant vertically opposite to b (due to that α 1 > α 2 ).1 The evolute is <strong>the</strong> locus of centers of curvatures of a curve.


On <strong>the</strong> Number of Minima to <strong>Weighted</strong> <strong>Orthogonal</strong> <strong>Procrustes</strong> <strong>Problem</strong>s 89Especially if b = 0, <strong>the</strong>n <strong>the</strong> two minima are q = [0, ±1], no matter of <strong>the</strong> valuesof α 1 <strong>and</strong> α 2 (as long as α 1 > α 2 is fulfilled).If α 1 = α 2 <strong>and</strong> b ≠ 0, <strong>the</strong>n <strong>the</strong> solution ˆq to (10) is unique, ˆq = b/||b|| 2 .However, if α 1 = α 2 <strong>and</strong> b = 0, <strong>the</strong>n <strong>the</strong>re is a continuum of solutions (connectedminima). Any q ∈ R 2 is a minimizer (<strong>and</strong> maximizer).Connected minima can occur <strong>for</strong> any dimensions, but, we focus on <strong>the</strong> casesyielding distinct minima. This always occurs if <strong>the</strong> singular values of A arestrictly decreasing, α i > α i+1 > 0 <strong>for</strong> all i = 1, . . .,m − 1. In Section 6.3, <strong>the</strong>case with equal singular values is studied.Deriving a similar ”evolute surface” <strong>for</strong> general ellipsoid cases, when q ∈R m×1 , m = 3, 4, . . ., is more complicated <strong>and</strong> not really so interesting. Whatis interesting is that <strong>the</strong> origin, <strong>the</strong> point b = 0, is inside <strong>the</strong>se surfaces. Thatis, when b = 0 <strong>the</strong> optimization problem (10) always has two minimizers q =[0, . . .,0, ±1]. The maximal amount of minima to an ellipsoid case is two, statedby Theorem D.1 in Appendix D.In connection to <strong>the</strong> ellipsoid cases are <strong>the</strong> oblique <strong>Procrustes</strong> problem [5, 13],commonly <strong>for</strong>mulated asmin 1 2 ||AQ − B||2 F , subject to diag(QT Q) = [1, . . .,1]. (11)Since <strong>the</strong>re are no orthogonality constraints q T i q j = 0, this problem is separable.By denoting B = [b 1 , . . . , b n ] we can write||AQ − B|| 2 F = ||Aq 1 − b 1 || 2 2 + . . . + ||Aq n − b n || 2 2.The problem (11) can <strong>the</strong>n be written as n optimization problems on <strong>the</strong> <strong>for</strong>min (10) asmin 1 2 ||Aq 1 − b 1 || 2 2 , subject to q T 1 q 1 = 1,min 1 2 ||Aq n − b n || 2 2 , subject to q T n q n = 1.. (12)Each of <strong>the</strong> n optimization problems in (12) can have two minimizers. Hence,<strong>the</strong> maximal number of minima <strong>for</strong> problem (11) is 2 n .5.2 Motivation of B = 0 in general casesThe difficulty in analyzing how many minima a WOPP might have, is that oneshould know a B that results in maximal amount of minimizers. At this point,doing analysis with an arbitrary B seems to border on <strong>the</strong> task of solving <strong>the</strong>optimization problem analytically, like one can do with an OPP. However, <strong>for</strong><strong>the</strong> ellipsoid cases when Q ∈ R m×1 , having B = 0 always results in maximalnumber of minima. Does <strong>the</strong> same hold <strong>for</strong> cases when Q ∈ R m×n when n > 1 ?<strong>Procrustes</strong> type problems have elliptic properties, studying <strong>the</strong> case when B = 0(or when B is in <strong>the</strong> vicinity of <strong>the</strong> origin) should yield valuable in<strong>for</strong>mation.


90 Paper IIIThe elliptic properties in this case, are that we can consider F as <strong>the</strong> surfacetraced out by <strong>the</strong> ellipses given by plane rotations around each unit axis in R m .As an example, <strong>for</strong> <strong>the</strong> ellipsoid case when Q ∈ R 3×1 , we can regard F as asurface of ellipses, where <strong>the</strong> plane spanned by any ellipse is parallel to ei<strong>the</strong>r<strong>the</strong> xy-plane, xz-plane or yz-plane. Take a parametrization as, e.g.,Q(φ 1 , φ 2 ) =⎡⎣ cosφ 1 − sin φ 1 0sin φ 1 cosφ 2 00 0 1⎤ ⎡⎦⎣ cosφ ⎤2 0 − sinφ 20 1 0 ⎦sin φ 2 0 cosφ 2⎡⎣ 1 00We can extend <strong>the</strong> concept of <strong>the</strong> evolute in R 2 to an ellipses in R mn . Anyof <strong>the</strong>se ellipses given by plane rotations, <strong>for</strong> any Q ∈ R m×n , can be written as[ ] cosφE + d,sin φwhere E ∈ R mn×2 <strong>and</strong> d is a translation of <strong>the</strong> ellipse along a direction orthogonalto <strong>the</strong> plane spanned by <strong>the</strong> ellipse, i.e., d⊥Range(E). By using <strong>the</strong> SVDUΣV T = E, <strong>the</strong> minimizationis equivalent tomin ||Eφ[ cosφsin φ[cosθmin ||Σφ sin θ[ ] [σ1 0 cosθmin ||φ 0 σ 2 sin θ]+ d − b|| 2 2⎤⎦.]+ ˜d − ˜b|| 2 2 = (13)] [ ] ˜b1− || 2 2 + ||˜b2⎡⎢⎣˜d 3 − ˜b ⎤3⎥. ⎦ || 2 2,˜d mn − ˜b mnwhere [cosθ, sin θ] T = V T [cosφ, sin φ] T , ˜d = U T d = [0, 0, ˜d 3 , ..., ˜d mn ] T <strong>and</strong> ˜b =U T b. What determines if (13) has two minima is [˜b 1 ,˜b 2 ] T ; <strong>the</strong> remaining mn−2elements in ˜b does not affect this at all. Hence by using <strong>the</strong> evolute <strong>for</strong> <strong>the</strong> R 2case, we can define a similar function⎡ ⎤α 2 1 −α2 2cos 3 θ˜h(θ, h 3 , ..., h mn ) =⎢⎣α 1α 2 2 −α2 1α 2h 3.h mnsin 3 θThe surface of ˜h is <strong>the</strong> boundary of <strong>the</strong> set of all ˜b such that (13) has maximalnumber of minimizers. A reasonable assumption would be that if a given pointb = vec(B) (that after SVD rotation U T b) is in each of <strong>the</strong>se sets given by everyellipse, maximal amount of minima would occur. One point that fulfils this is<strong>the</strong> origin B = 0, since <strong>the</strong>n ˜b = U T b = 0 <strong>for</strong> all ellipses..⎥⎦


On <strong>the</strong> Number of Minima to <strong>Weighted</strong> <strong>Orthogonal</strong> <strong>Procrustes</strong> <strong>Problem</strong>s 916 The B = 0 caseIn this section, we study <strong>the</strong> special case when B = 0. The practical relevancewith B = 0 is perhaps insignificant. However, when studying how many minimaa WOPP can have, it is an intuitive first approach. We derive all critical pointsto <strong>the</strong> optimization problem <strong>and</strong> classify which are minimum, maximum <strong>and</strong>inflection points. Under some conditions on A <strong>and</strong> X, we show that <strong>the</strong>re are2 n minima. Earlier studies of <strong>the</strong> number of minima to optimization problemson Stiefel manifolds, similar to a WOPP with B = 0, have been done in [2]. For<strong>the</strong> problem type considered in [2], <strong>the</strong> number of minima is also 2 n .6.1 First order conditions <strong>and</strong> <strong>the</strong> critical pointsWe now consider a WOPP with B = 0,min 1 2 ||AQX||2 F , subject to QT Q = I n⇓min 1 2 ||Fvec(Q)||2 2 , subject to QT Q = I n .(14)Additionally, we also assume that <strong>the</strong> diagonal elements (singular values) of A<strong>and</strong> X are strictly decreasing, i.e.,<strong>and</strong>α i > α i+1 > 0 ⇒ D i > D i+1 (15)χ i > χ i+1 > 0. (16)The reason <strong>for</strong> this assumption is that <strong>the</strong> WOPP with B = 0 will not haveconnected minima. Later on, in Section 6.3, we study some cases with equalsingular values (α i = α i+1 or χ i = χ i+1 <strong>for</strong> at least one i).Theorem 6.1 Any critical point of (14) only has 0, 1 <strong>and</strong>/or −1 as elements.Proof. The Lagrangian corresponding to <strong>the</strong> problem isL(Q, Λ) = 1 2 ||Fvec(Q)||2 2 +n∑λ i,i (qi T q i − 1) +i=1n∑λ i,j qi T q j . (17)According to (6), a stationary point results in n sets of m equations∇ qi L = χ 2 iDq i + λ i,i q i +n∑λ j,i q j +j


92 Paper IIIχ 2 jDq j + λ j,j q j + λ 1,j q 1 + ... + λ i,j q i + ... + λ j,n q n = 0. (20)Due to orthogonality, multiplying (19) with q T j <strong>and</strong> (20) with q T igivesχ 2 i q T j Dq i + λ i,j = χ 2 iγ + λ i,j = 0 (21)χ 2 jq T i Dq j + λ i,j = χ 2 jγ + λ i,j = 0. (22)The condition (16) implies that γ = 0, yielding λ i,j = 0.The set of equations (18) <strong>the</strong>n have <strong>the</strong> <strong>for</strong>m of an eigenvalue problemχ 2 1 Dq 1 = −λ 1,1 q 1χ 2 2 Dq 2 = −λ 2,2 q 2. (23).χ 2 nDq n = −λ n,n q nHence each λ i,i must be equal to any −χ 2 i D j, j = 1, ..., m since D is adiagonal matrix. Consequently q i = ±e j , where e j denotes any column vectorof I ∈ R m×m . Q is orthogonal, so clearly if q i = ±e j <strong>the</strong>n any o<strong>the</strong>r columnvector of Q fulfills q k = ±e l where j ≠ l. That is, if q 1 = ±e i <strong>the</strong>n q 2 = ±e j<strong>and</strong> q 3 = ±e k <strong>and</strong> so on. ✷Now we know all critical points to (14). What remains is to classify whichof those that are minima.6.2 Second order conditions <strong>and</strong> <strong>the</strong> minimum solutionsTheorem 6.2 A problem of <strong>the</strong> <strong>for</strong>m (14) has 2 n minima, <strong>and</strong> each minimumis on <strong>the</strong> <strong>for</strong>m[ ] ZKwhere Z is an m − n by n zero matrix <strong>and</strong> K is an n by n anti diagonal matrixwith arbitrary ±1 as elements. Additionally, <strong>the</strong> minimum value ||A ˆQ i X|| 2 F <strong>for</strong>all minima ˆQ i , i = 1, ..., 2 n , is <strong>the</strong> same.To prove Theorem 6.2, we make use of <strong>the</strong> following lemmas. Each of <strong>the</strong>lemmas excludes <strong>for</strong>ms of Q that results in non-minimum critical points. Whenproving <strong>the</strong> lemmas, we make use of <strong>the</strong> necessary conditions (8). An importantthing is that at a critical point, <strong>the</strong> Hessian is a diagonal matrixsince λ i,j = 0 whenever i ≠ j.H = diag(χ 2 1D 2 + λ 1,1 I, ..., χ 2 nD 2 + λ n,n I),Lemma 6.1 A minimum ˆQ can not be on <strong>the</strong> <strong>for</strong>m⎡ ⎤ˆQ = ⎣ U z TV⎦,


On <strong>the</strong> Number of Minima to <strong>Weighted</strong> <strong>Orthogonal</strong> <strong>Procrustes</strong> <strong>Problem</strong>s 93where z T is a row of zeros <strong>and</strong> U ∈ R p×n has at least one row that contains a1 or −1.Proof. Assume that <strong>the</strong>re is at least one element U i,j in U that has anelement equal to ±1, i.e., U i,j = ±1. When choosing a tangent direction fromDefinition 3.1, take S = 0 <strong>and</strong> we get⎡T = (I − ˆQ ˆQ T )C = ⎣ I − ⎤UUT 0 UV T0 1 0 ⎦C.V U T 0 I − V V TLet k = p + 1 <strong>and</strong> choose all elements in C, apart from C k,j = c, as zero. ThenT = C, so t = vec(T) has elements t m(j−1)+k = c <strong>and</strong> zeros elsewhere. Denote<strong>the</strong> jth column in T by T j , <strong>the</strong>n <strong>the</strong> condition (8) becomest T Ht = T T j (χ2 j D + λ j,jI)T j = c 2 (χ 2 j D k + λ j,j ).Looking back on (23), we see that if U i,j = ±1, <strong>the</strong>n q j = ±e i so λ j,j =−χ 2 j D i. But since i ≤ p <strong>and</strong> k > p ⇒ k > i <strong>the</strong>n, by (15), D k − D i < 0 sot T Ht = c 2 χ 2 j (D k − D i ) < 0. We have shown that whenever <strong>the</strong>re is an elementequal to ±1 at a row i in ˆQ, <strong>and</strong> <strong>the</strong>re is a row j with j > i containing justzeros, it is possible to find a tangent direction t resulting in that t T Ht < 0.Hence ˆQ can not be a minimizer. ✷If ˆQ now should be a minimizer, all m − n rows containing zeros must bein <strong>the</strong> top of <strong>the</strong> matrix. What is left to show is that <strong>the</strong> remaining n rows,containing ±1 elements, must be ordered to <strong>for</strong>m a n by n anti-diagonal matrix.Lemma 6.2 If <strong>the</strong> element ˆQ m,1 = 0 <strong>the</strong>n ˆQ is not a minimizer.Proof. Assume that[ Z ˆQ =P]where P ∈ R n×n is orthogonal <strong>and</strong> Z is a zero matrix. Also assume thatˆQ m,1 ≠ ±1. This results in that ˆQ i,1 = ±1 <strong>for</strong> one i ∈ {m − n + 1, ..., m − 1}<strong>and</strong> ˆQ m,j = ±1 <strong>for</strong> one j ∈ {2, ..., n}. This means that <strong>the</strong>re is a ±1 elementon row i in <strong>the</strong> first column, with (m − n) < i < m. Additionally, <strong>the</strong>re is a ±1element in column j in <strong>the</strong> last row m where j > 1.As tangent direction take C = 0 <strong>and</strong> choose <strong>the</strong> skew-symmetric matrix Sas S j,1 = s, S 1,j = −s <strong>and</strong> zeroes elsewhere. Then T = ˆQS has zero elementseverywhere apart from <strong>the</strong> two elements T m,1 = ±s <strong>and</strong> T i,j = ±(−s). LetT 1 <strong>and</strong> T j , respectively, be <strong>the</strong> column vectors in T that contain <strong>the</strong>se nonzeroelements. The condition (8) is <strong>the</strong>nt T Ht = T T 1 (χ2 1 D + λ 1,1I)T 1 + T T j (χ2 j D + λ j,jI)T j == s 2 (χ 2 1D m + λ 1,1 ) + s 2 (χ 2 j D i + λ j,j ).(24)


94 Paper IIINow q 1 = ±e i <strong>and</strong> q j = ±e m , so λ 1,1 = −χ 2 1D i <strong>and</strong> λ j,j = −χ 2 j D m. Substitutingthis into (24) yieldst T Ht = s 2 (χ 2 1D m − χ 2 1D i + χ 2 jD i − χ 2 jD m ) == s 2 (χ 2 1 − χ 2 j)(D m − D i ) < 0,since (χ 2 1 − χ2 j ) > 0 <strong>and</strong> (D m − D i ) < 0 by (16) <strong>and</strong> (15), respectively.We have shown that if ˆQ 1,m ≠ ±1 <strong>the</strong>n we can always find a tangent directiont = vec(T) such that t T Ht < 0. Hence, if ˆQ should be a minimizer, ˆQ1,m mustbe equal to ±1. ✷Lemma 6.3 Assume that⎡ˆQ = ⎣0 00 P˜K 0⎤⎦where ˜K ∈ R r×r is anti-diagonal with elements ±1 <strong>and</strong> P ∈ R (n−r)×(n−r) . IfP n−r,1 ≠ ±1 <strong>the</strong>n ˆQ is not a minimizer.Proof. Let q r+1 = ±e i with (m − n) ≤ i < (m − r) <strong>and</strong> q j = ±e m−r withr + 1 < j ≤ n. Choose tangent direction as C = 0, but now choose S j,r+1 = s(<strong>and</strong> S r+1,j = −s due to skew symmetry) <strong>and</strong> zeroes elsewhere. We now getT m−r,r+1 = ±s <strong>and</strong> T i,j = ±(−s) <strong>and</strong> all o<strong>the</strong>r elements are equal to zero.Denote T r+1 <strong>and</strong> T j as <strong>the</strong> columns containing <strong>the</strong>se two elements, we <strong>the</strong>n gett T Ht = T T r+1(χ 2 r+1D + λ r+1,r+1 I)T r+1 + T T j (χ2 j D + λ j,jI)T j == s 2 (χ 2 r+1 D m−r + λ r+1,r+1 ) + s 2 (χ 2 j D i + λ j,j ).(25)The Lagrange parameters are λ r+1,r+1 = −χ 2 r+1D i <strong>and</strong> λ j,j = −χ 2 j D m−r, substitutingthis into (25) <strong>and</strong> we gett T Ht = s 2 (χ 2 r+1D m−r − χ 2 r+1D i + χ 2 jD i − χ 2 jD m−r ) == s 2 (χ 2 r+1 − χ 2 j)(D m−r − D i ) < 0,by (16) <strong>and</strong> (15) since r + 1 < j <strong>and</strong> i < (m − r) respectively. ✷Proof of Theorem 6.2. By lemmas 6.1, 6.2 <strong>and</strong> 6.3, take⎡ˆQ = ⎣0 00 ˜P˜K 0⎤⎦where˜K :=[ 0 ±1˜K 0], ˜K ∈ R (r+1)×(r+1) , ˜P ∈ R(n−r−1)×(n−r−1)


On <strong>the</strong> Number of Minima to <strong>Weighted</strong> <strong>Orthogonal</strong> <strong>Procrustes</strong> <strong>Problem</strong>s 95<strong>and</strong> induction follows trivially, i.e., a minimizer must be of <strong>the</strong> <strong>for</strong>m stated inTheorem 6.2.The only thing left to prove is that a matrix of <strong>the</strong> <strong>for</strong>m[ ] Z ˆQ =Kis a minimizer. The tangent direction at ˆQ is[T = ˆQS + (I − ˆQ ˆQ T )C = ˆQSIm−n 0+0 0] [ ]C1=C 2[C1KSObserve that C ≠ 0 only contributes with positive terms to <strong>the</strong> condition t T Ht.Hence we can choose C = 0 <strong>for</strong> simplicity to get[ ]T = . 0˜S].ofThe matrix ˜S = KS has <strong>the</strong> ”permuted <strong>and</strong> possibly negated” appearance⎡⎤±s 1,n ±s 2,n ... ±s n−1,n 0±s 1,n−1 ... ±s n−2,n−1 0 ±s n,n−1˜S =⎢ : ... 0 ... :⎥⎣ ±s 1,2 0 ±s 3,2 ... ±s n,2⎦ .0 ±s 2,1 .... ±s n−1,1 ±s n,1However, as we shall see, it is <strong>the</strong> absolute value of <strong>the</strong>se elements that areimportant. The necessary condition ist T Ht =n∑n∑i=1 j=1,j≠i(χ 2 iD m−j+1 +λ i,i )s 2 i,j =n∑n∑i=1 j=1,j≠iSince s 2 i,j = s2 j,i , we can collect <strong>the</strong>se terms <strong>and</strong> write (26) ast T Ht =n∑i=1 j>i(χ 2 iD m−j+1 −χ 2 iD 2 m−i+1)s 2 i,jn∑s 2 i,j (χ2 i D m−j+1 − χ 2 i D m−i+1 + χ 2 j D m−i+1 − χ 2 j D m−j+1) ==n∑i=1 j>i(26)n∑s 2 i,j(χ 2 i − χ 2 j)(D m−j+1 − D m−i+1 ) ≥ 0, (27)since j ≥ i ⇒ (χ 2 i − χ2 j ) > 0 <strong>and</strong> (D m−j+1 − D m−i+1 ) < 0 by (16) <strong>and</strong> (15).Equality t T Ht = 0 only occurs if t = 0, so <strong>the</strong> last condition (27) is a sufficientcondition <strong>for</strong> a minimizer, i.e., t T Ht > 0, t = vec(T) ∀ T ∈ T .


96 Paper IIIFinally, it is easily seen that each minimum ˆQ i results in <strong>the</strong> same objectivefunction value,||A ˆQ i X|| 2 F = ||Σvec( ˆQ)|| n 2 2 = ∑(χ 2 i D m−i+1(±1) 2 ) =i=1n∑χ 2 i α2 m−i+1 . ✷In a similar way, all maxima to (14) can be proven to be on <strong>the</strong> <strong>for</strong>m[ ]diag(±1, . . .,±1)Q =∈ RZm×n .The remaining critical points that are not minima or maxima, are <strong>the</strong>n saddlepoints.6.3 Some cases with equal singular valuesIf two or more singular values are equal, e.g., α i = α i+1 <strong>and</strong>/or χ i = χ i+1 , <strong>the</strong>optimization problem with B = 0 can have connected minima. This is easilyunderstood when looking at <strong>the</strong> 2 by 1 case with A = I 2 <strong>and</strong> X = 1. Then <strong>the</strong>optimization problem min ||AQX|| 2 F , subject to QT Q = 1, consists of finding<strong>the</strong> shortest distance from <strong>the</strong> unit circle to <strong>the</strong> origin. Obviously, this problemhas an infinite amount of solutions since <strong>the</strong> distance from <strong>the</strong> unit circle to <strong>the</strong>origin (radii) is constant.The orthogonal <strong>Procrustes</strong> problem is a case with equal singular values. AnyQ ∈ R m×n yields <strong>the</strong> same objective function value if B = 0, because of <strong>the</strong>circular properties of F.If Q ∈ R m×1 <strong>and</strong> α i = α i+1 = ... = α m we get a subspace minimizing <strong>the</strong>problem according to[ ] z ˆQ = ,qwhere z ∈ R i−1 is a zero vector <strong>and</strong> q ∈ R m−i+1 fulfills q T q = 1. The samehappens <strong>for</strong> unbalanced problems of general dimensions, e.g., take X = I n <strong>the</strong>nif ˆQ is a minimizer to min 1 2 ||AQ||2 F , so is any Q = ˆQV where V ∈ R n×n is anyorthogonal matrix since||AQ|| 2 F = trace(Q T A T AQ) = trace(QQ T A T A) = (28)= trace( ˆQV T V ˆQ T A T A) = ||A ˆQ|| 2 F .i=17 Discussion of <strong>the</strong> general case B ≠ 0For <strong>the</strong> OPP it is known that if B is rank deficient, <strong>the</strong> solution is not unique.In this section, we consider some special cases when B ≠ 0 that result in severalminima <strong>for</strong> different setups of (1). As earlier mentioned, doing analysis with anarbitrary B is beyond <strong>the</strong> scoop of this paper.


On <strong>the</strong> Number of Minima to <strong>Weighted</strong> <strong>Orthogonal</strong> <strong>Procrustes</strong> <strong>Problem</strong>s 97For a problem of <strong>the</strong> <strong>for</strong>m (14) define <strong>the</strong> function g(s, b) = ||Fvec(Q) −b|| 2 2,where s ∈ R p is a parametrization of Q <strong>and</strong> b ∈ R mn . The gradient of g(s, b)with respect to s is ∇ s g(s, b) ∈ R p <strong>and</strong> at an extreme point ∇ s g(s, b) = 0 isfulfilled. By <strong>the</strong> implicit function <strong>the</strong>orem, if det(∇ 2 sg(s, b)) ≠ 0 <strong>the</strong>re exists aneighborhood W of b <strong>and</strong> a unique continuously function h : W ↦→ R p such that∇ s g(h(u), u) = 0 <strong>for</strong> all u ∈ W. Let W i , i = 1, .., 2 n , be <strong>the</strong> neighborhood <strong>for</strong>each minima given when b = 0, <strong>the</strong>n <strong>the</strong> optimization problem has at least 2 nminima <strong>for</strong> all b ∈ ⋂ 2 ni=1 W i.7.1 Some examplesIn <strong>the</strong> ellipsoid cases <strong>the</strong>re is a number γ such that if ||B|| F > γ <strong>the</strong>n <strong>the</strong>problem has an unique minimizer. The same does not hold <strong>for</strong> general caseswhen n > 1. Similar to <strong>the</strong> case when an OPP lacks a unique minimizer, it isalways possible to choose a rank deficient B at infinity such that (1) has morethan one minima. Let Q ∈ R 3×2 <strong>and</strong> takeB =⎡⎣ 0 00 0β 0where β > 0 is arbitrary large, <strong>the</strong>n <strong>the</strong> optimization problem has <strong>the</strong> twominimizers⎡⎣ 0 0 ⎤ ⎡0 1 ⎦ , ⎣ 0 0 ⎤0 −1 ⎦.1 0 1 0This is easily generalized to hold <strong>for</strong> problems of general dimensions <strong>and</strong> we c<strong>and</strong>raw <strong>the</strong> conclusion that <strong>the</strong>re exist no bounded, finite ”evolute surface” as in<strong>the</strong> ellipsoid cases.For an unbalanced problem with X = I n , (28) indicates circular propertiesof F close to <strong>the</strong> origin. One might think that <strong>for</strong> a small perturbation B = δB,<strong>the</strong>re should be less than 2 n minima. This is not necessarily <strong>the</strong> case, withQ ∈ R 3×2 (<strong>and</strong> X = I 2 ) takeB =⎡⎣ β 00 00 0For a sufficiently small β > 0 <strong>the</strong>re is still 2 n = 4 minimizers on <strong>the</strong> <strong>for</strong>m⎡ˆQ = ⎣ cos ˆφ ⎤0± sin ˆφ 0 ⎦,0 ±1where ˆφ is <strong>the</strong> solution to[ ][α1 0 cosφmin ||φ 0 α 2 sin φ⎤⎦,⎤⎦.] [ β−0]|| 2 2.


98 Paper IIILet us assume that <strong>the</strong> maximal amount of unconnected minima to (1) is 2 n .For a given B = ˜B, one could <strong>the</strong>n assume that as B → 0 <strong>the</strong> minimizers wouldcontinuously follow. An idea would be to try <strong>the</strong> opposite, i.e., an algorithmthat starts out at B = 0 <strong>and</strong> approaches <strong>the</strong> given value B = ˜B.7.2 A simple algorithmConsider <strong>the</strong> following optimization problemminQ ||AQX − βB||2 F , subject to Q T Q = I n ,where β is a parameter ranging from 0 to 1. At β = 0 <strong>the</strong>re are several minimizers,but it is possible to derive which of <strong>the</strong>se minima that gives <strong>the</strong> leastobjective function value ||AQX − B|| 2 F . Let Q 0 be this minimum. If α i > α i+1<strong>for</strong> i = 1, ..., m <strong>and</strong> X = I n , Q 0 is be on <strong>the</strong> <strong>for</strong>m[ ]Q 0 = , Z˜Qwhere ˜Q ∈ R n×n is orthogonal, <strong>and</strong> Z ∈ R (m−n)×n is a zero matrix. Theobjective function is <strong>the</strong>n||AQ 0 − B|| 2 F = trace(QT 0 AT AQ 0 − 2Q T 0 AT B + B T B).Because of <strong>the</strong> special structure of Q 0 <strong>the</strong> first can be removed, so min ||AQ 0 −B|| 2 F = maxtrace(QT 0 AT B). Per<strong>for</strong>m a SVD of A T B as[ ]UΣV T U1= ΣVU T = A T B,2where U 1 ∈ R (m−n)×m <strong>and</strong> U 2 ∈ R n×m , <strong>the</strong>ntrace(Q T 0 AT B) = V T [Z T , ˜Q T ][U1U 2]Σ = trace(V T ˜QT U 2 Σ).Since V T ˜QT is an n by n orthogonal matrix we can derive <strong>the</strong> optimal solution,analogous to <strong>the</strong> procedure <strong>for</strong> an orthogonal <strong>Procrustes</strong> problem, by using<strong>the</strong> SVD of U 2 Σ. Let Ũ ˜ΣṼ T = U 2 Σ, <strong>the</strong>n trace(V T ˜QT Ũ ˜ΣṼ T ) is maximizedif Ṽ T V T ˜QT Ũ = I n , i.e., if ˜Q = ŨV T Ṽ T . Here X = I n was used, that isχ i = χ i+1 = 1, but <strong>the</strong> same procedure can be applied <strong>for</strong> cases with χ i ≥ χ i+1 .Consider now an algorithm as1. Compute Q 0 .2. k = 0.3. <strong>for</strong> β > 0 to β = 1,3.1 k = k + 1,


On <strong>the</strong> Number of Minima to <strong>Weighted</strong> <strong>Orthogonal</strong> <strong>Procrustes</strong> <strong>Problem</strong>s 993.2 Let Q k be <strong>the</strong> solution to4. end <strong>for</strong>.min ||AQX − βB|| 2 F , (29)when using Q k−1 as <strong>the</strong> initial value <strong>for</strong> <strong>the</strong> iterative method used tosolve (29).Does Q k converge to <strong>the</strong> global minimum as β → 1 ? Empirical studieshave shown that this is not always <strong>the</strong> case. For an optimal Q 0 (computed asabove), Q k can at some point when β = ˜β become a local minima, even though<strong>the</strong> trajectory of Q k ’s followed is (to <strong>the</strong> very best seemed to be) continuous.Studies have shown that starting with a non-optimal Q 0 can yield a continuoustrajectory converging towards <strong>the</strong> global minimum as β → 1. Non-optimal heremeans that Q 0 is a minimum to min ||AQX|| 2 F , but not optimal in <strong>the</strong> sense ofmin ||AQ 0 X − B|| 2 F as described above. That is, a local minimum can becomea global minimizer at some point on <strong>the</strong> trajectory. It is not clear why this canhappen.8 Concluding remarksStudying <strong>the</strong> different cases when B ≈ 0 gives an insight to <strong>the</strong> amount ofminima a WOPP may have <strong>and</strong> how <strong>the</strong>y are located in relation to each o<strong>the</strong>r.Extensive empirical studies, some presented in [20], have shown that not morethan 2 n minima exist <strong>for</strong> a WOPP, <strong>and</strong> it feels reasonable to conjecture thatthis is true.Not mentioned here, are some continuation (homotopy) methods that wereconsidered. As with <strong>the</strong> algorithm described in Section 7.2, <strong>the</strong>y too failed insome cases. However, in connection to this work, a successful algorithm hasbeen developed to compute all minimizers [20]. But <strong>the</strong>re are still much to beunderstood about how <strong>the</strong> geometric properties of <strong>the</strong>se problems can be usedto achieve global minimization <strong>for</strong> a general B.AppendixA The canonical <strong>for</strong>m of a WOPPProposition A.1 The matrices A ∈ R mA×m <strong>and</strong> X ∈ R n×nX with Rank(A) =m <strong>and</strong> Rank(X) = n belonging to a WOPPmin 1 2 ||AQX − B||2 F , subject to QT Q = I n ,can always be considered as m by m <strong>and</strong> n by n diagonal matrices, respectively.


100 Paper IIIbe <strong>the</strong> singular value decom-Proof. Let A = U A Σ A VA T <strong>and</strong> X = U XΣ X VX Tposition of A <strong>and</strong> X. Then||U A Σ A V T A QU B Σ B V T B − B|| 2 F = ||U A Σ A ZΣ X V T X − B|| 2 F ,where Z = V T A QU X ∈ R m×n has orthonormal columns. Since U T A U A = I mA<strong>and</strong> V T X V X = I nX it follows that||U A Σ A ZΣ X V T X − B|| 2 F = tr(U A Σ A ZΣ X V T X − B) T (U A Σ A ZΣ X V T X − B) == tr(V X Σ X Z T Σ 2 A ZΣ XV T X − 2V XΣ X Z T Σ 2 A UT A B + BT B) == tr(Σ X Z T Σ 2 A ZΣ X) − tr(2Σ B Z T Σ A U T A BV X) + tr(B T B) =tr(Σ A ZΣ X − U T A BV X) T (Σ A ZΣ X − U T A BV X) = ||Σ A ZΣ X − U T A BV X|| F .Hence, without loss of generality we can assume that A = diag(α 1 , ..., α m ) <strong>and</strong>X = diag(χ 1 , ..., χ n ) with α i ≥ α i+1 ≥ 0 <strong>and</strong> χ i ≥ χ i+1 ≥ 0. ✷B The solution to an OPPTheorem B.1 Let X ∈ R n×n <strong>and</strong> B ∈ R m×n be known matrices with Rank(X) =n <strong>and</strong> Rank(B) = n. Then <strong>the</strong> solution ˆQ of <strong>the</strong> orthogonal <strong>Procrustes</strong> problemmin 1 2 ||QX − B||2 F , subject to Q T Q = I n , (30)is ˆQ = V I m,n U T , where U <strong>and</strong> V are <strong>the</strong> orthogonal matrices given by <strong>the</strong>singular value decomposition UΣV T = XB T .Proof. Since||QX − B|| 2 F = trace((QX − B)T (QX − B)) == trace((QX) T (QX)) + trace(B T B) − trace((QX) T B) − trace(B T (QX)) =Equation (30) is equivalent to||X|| 2 F + ||B|| 2 F − 2trace(B T QX).maxtrace(B T QX) , subject to Q T Q = I n . (31)Note that trace(B T QX) = trace(XB T Q) <strong>and</strong> let UΣV T = XB T be a singularvalue decomposition. Use <strong>the</strong> matrix Z = V T QU, Z ∈ R m×n , <strong>and</strong> wegetn∑trace(XB T Q) = trace(ΣV T QU) = trace(ΣZ) = σ i z i,i .Since Z has orthonormal columns, <strong>the</strong> upper bound of (31) is given by havingz i,i = 1, i.e., Z = I m,n . The solution to (30) is <strong>the</strong>n V T QU = I m,n ⇒ Q =V I m,n U T . ✷i=1


On <strong>the</strong> Number of Minima to <strong>Weighted</strong> <strong>Orthogonal</strong> <strong>Procrustes</strong> <strong>Problem</strong>s 101If we consider <strong>the</strong> balanced case of a WOPP with X = I n ,min 1 2 ||AQ − B||2 F , subject to Q T Q = I n , (32)<strong>the</strong>n (32) is an OPP since Q is orthogonal [11].C Parametrization of V m,n by using <strong>the</strong> Cayleytrans<strong>for</strong>mThe Cayley trans<strong>for</strong>m is often used to represent orthogonal matrices with positivedeterminants asQ(S) = (I + S)(I − S) −1 , (33)where S ∈ R m×m is skew-symmetric (S = −S T ). Since S has imaginary eigenvalues,(I − S) does always have full rank. This parametrization fails in somecases, namely when ( ˜Q+I) is singular. As an example, <strong>the</strong>re exist no S ∈ R 2×2such that Q(S) = diag(−1, −1). Instead of using (33) as a parametrizationof orthogonal matrices, a local parametrization can be used. Given a point˜Q ∈ V m,n , we can express any Q ∈ V m,m in <strong>the</strong> vicinity of ˜Q by usingQ(S) = ˜Q(I + S)(I − S) −1 . (34)To get a local parametrization of V m,n when n ≤ m, (34) is modified accordingto <strong>the</strong> following. Given a point ˜Q ∈ V m,n , <strong>the</strong>n a parametrization <strong>for</strong> anyQ ∈ V m,n in <strong>the</strong> vicinity of ˜Q can be written asQ(S) = [ ˜Q, ˜Q ⊥ ](I + S)(I − S) −1 I m,n . (35)Here ˜Q ⊥ is any extension such that [ ˜Q, ˜Q ⊥ ] ∈ R m×m is orthogonal <strong>and</strong>[ ]InI m,n = ∈ R0m×n .S is skew-symmetric according to[S11 −S21S =T S 21 0], (36)where S 11 ∈ R n×n is skew-symmetric, S 21 ∈ R m×n is arbitrary <strong>and</strong> <strong>the</strong> remaininglower right part is a zero matrix. Observe that if m = n, <strong>the</strong>n (35) is <strong>the</strong>same as (34).C.1 The Tangent space of V m,nDefinition C.1 The tangent space of <strong>the</strong> Stiefel manifold V m,n at a given point˜Q can be expressed asT = {T = ˜QS + (I − ˜Q ˜Q T )C}, (37)where S ∈ R n×n is skew-symmetric <strong>and</strong> C ∈ R m×n arbitrary.


102 Paper IIIBy using <strong>the</strong> power expansion(35) can be expressed as(I − S) −1 = I + S + S 2 + S 3 + ...Q(S) = [ ˜Q, ˜Q ⊥ ](I + S)(I + S + S 2 + ...)I m,n .The first order linear approximation in S is <strong>the</strong>nQ(S) ≈ [ ˜Q, ˜Q ⊥ ](I + 2S)I m,n = ˜Q + [ ˜Q, ˜Q ⊥ ]2SI m,n .The second term, that is dependent on S, gives a representation of <strong>the</strong> tangentspace as[ ˜Q, ˜Q ⊥ ]2SI m,n = 2 ˜QS 11 + 2 ˜Q ⊥ S 21 . (38)With 2S 11 = S <strong>and</strong> since Range(Q ⊥ ) = Range(I − ˜Q ˜Q T ), (38) is <strong>the</strong> same as(37). Note that (38) is independent of <strong>the</strong> zero elements in lower right part ofS due to multiplication with I m,n , implying that S is on <strong>the</strong> <strong>for</strong>m given in (36).D Number of minima to <strong>the</strong> ellipsoid casesTheorem D.1 If A ∈ R m×m has distinct singular values α i > α i+1 > 0,i = 1, . . .,m − 1, <strong>the</strong>n (10) has a maximum of two minimizers.In <strong>the</strong> proof that follows, it is shown that if a point ˜q ∈ R m fulfills sign(˜q i ) ≠sign(b i ) <strong>for</strong> any i = 1, . . .,m − 1, <strong>the</strong>n ˜q is not a minimizer. Then it onlyremains two possible minimizers, i.e., ei<strong>the</strong>r sign(˜q m ) = sign(b m ) or sign(˜q m ) =−sign(b m ). As a reminder, we also assume that A is diagonal (on <strong>the</strong> canonical<strong>for</strong>m). First we make an assumption that is made clear during <strong>the</strong> proof.Assumption D.1 With <strong>the</strong> conditions stated in Theorem D.1, let ˆq be a minimizerto (10). Then it does not exist any o<strong>the</strong>r minimum ¯q such that sign(¯q i ) =sign(ˆq i ) <strong>for</strong> all i = 1, . . .,m. That is, any o<strong>the</strong>r minimum ¯q of (10) should haveat least one element ¯q i with a different sign than ˆq i .Proof. First assume that b i ≠ 0 <strong>for</strong> all i = 1, . . .,m. The case when anyb i = 0 is considered at <strong>the</strong> end of <strong>the</strong> proof.Now, let ˜q be a point with sign(˜q k ) ≠ sign(b k ) where 1 ≤ k ≤ m − 1. LetU(φ) ∈ R m×m be a plane rotation in <strong>the</strong> plane spanned by [e k , e m ] according toU(φ) =⎡⎢⎣Observe that U(0) = I m <strong>and</strong> thatI k−1 0 0 00 cosφ 0 − sinφ0 0 I m−k−1 00 sinφ 0 cosφminφ||AU(φ)˜q − b|| 2 2,⎤⎥⎦ . (39)


On <strong>the</strong> Number of Minima to <strong>Weighted</strong> <strong>Orthogonal</strong> <strong>Procrustes</strong> <strong>Problem</strong>s 103is equivalent to[ ] [αk 0 cosφ − sin φmin ||φ 0 α m sinφ cosφ][ ] [ ] ˜qk bk− || 2˜q m b2 =m= minφ ||ÂÛ(φ)z − ˜b|| 2 2, (40)where z = [˜q k , ˜q m ] T . If ˜q is a minimizer, <strong>the</strong>n any arbitrary small δφ ≠ 0 wouldyield ||ÃÛ(δφ)z − ˜b|| 2 > ||Ãz − ˜b|| 2 . Equation (40) is just <strong>the</strong> ellipse problemdescribed earlier, i.e., takeà = ||z|| 2  , ˜z(φ) = Û(φ) z||z|| 2<strong>the</strong>n ||˜z(φ)|| 2 = 1 <strong>and</strong> (40) is <strong>the</strong> same asminφ ||Øz(φ) − ˜b|| 2 2 . (41)Assume <strong>for</strong> simplicity that ˜b is in <strong>the</strong> first quadrant in R 2 , according to Figure2. Since sign(˜q k ) ≠ sign(b k ) <strong>the</strong>n ˜z(0) is ei<strong>the</strong>r in <strong>the</strong> second or third quadrant,depending on <strong>the</strong> sign of ˜q m . However, no matter what sign of ˜q m , since α k > α m<strong>the</strong>re exists an arbitrary small |δφ| > 0 such that||Øz(δφ) − ˜b|| 2 < ||Øz(0) − ˜b|| 2 ⇒ ||AU(δφ)˜q − b|| 2 < ||A˜q − b|| 2 , (42)hence ˜q can not be a minimizer.˜brØz(0)Øz(δφ)Figure 2: The ellipse determined by à with semi major axis α k <strong>and</strong> semi minoraxis α m . At φ = 0 <strong>the</strong> residual r = Øz(0) − ˜b is shown. The dotted circlewith radii ||r|| 2 is centered at b <strong>and</strong> <strong>the</strong> direction δφ implies that ˜z(0) is not aminimizer.Two scenarios when (42) does not to hold are :


104 Paper III1). If z(0) is a global minimum to (41) in <strong>the</strong> first quadrant ⇒ sign(˜q k ) =sign(b k ) <strong>and</strong> sign(˜q m ) = sign(b m ).2). If z(0) is a local minimum to (41) in <strong>the</strong> fourth quadrant ⇒ sign(˜q k ) =sign(b k ) <strong>and</strong> sign(˜q m ) = −sign(b m ).We have shown that any minimizer ˆq of (10) must fulfil sign(ˆq i ) = sign(b i )<strong>for</strong> all i = 1, . . .,m−1. By connecting two points with an ellipse, it should nowperhaps be clear that Assumption D.1 is valid. Let us assume <strong>the</strong> opposite, that¯q is also a minimizer <strong>and</strong> that sign(¯q i ) = sign(ˆq i ) is fulfilled <strong>for</strong> all i = 1, . . .,m.Then connecting Aˆq <strong>and</strong> A¯q (<strong>and</strong> back to ˆq again) with an ellipse, would yield acondition on <strong>the</strong> <strong>for</strong>m (42). Then only ˆq can be a minimizer. From this we canconclude that if b i ≠ 0 <strong>for</strong> all i = 1, . . .,m <strong>the</strong> global minimizer of (10) fulfills1) whereas <strong>the</strong> second, local, minimizer (if any) fulfills 2). Hence a maximumof two minimizers can occur.Now assume b has p zero elements b k = 0 where k ∈ {1, . . .,m − 1}. Byusing plane rotations as above in (39) <strong>and</strong> with ˜b = [0, b m ] T , it is shown that aminimizer ˆq must have ˆq k = 0. If ˆq k ≠ 0 <strong>the</strong>n <strong>the</strong>re exists a δφ such that (42)holds, no matter what b m is. Hence we can remove all p zero elements in b <strong>and</strong>corresponding p equations in Aq, yielding an optimization problem with q <strong>and</strong>b in R m−p . If now b m−p ≠ 0, <strong>the</strong>n we have just <strong>the</strong> case described first withb i ≠ 0 <strong>for</strong> all i = 1, . . .,m − p. That is, (10) can at most have two minimizers.Lastly assume that b m = 0, <strong>the</strong>n by using plane rotations U(φ) as earlierbut in <strong>the</strong> plane spanned by [e k , e m−1 ] <strong>and</strong> with ˜b = [b k , b m−1 ] T , <strong>the</strong> conclusionis that <strong>the</strong> elements of a minimizer ˆq must fulfill sign(ˆq i ) = sign(b i ) <strong>for</strong>all i = 1, . . .,m − 2. The element ˆq m−1 however, could be assumed to fulfillsign(ˆq m−1 ) = ±1. But now take a plane rotation in <strong>the</strong> plane spanned by[e m−1 , e m ] <strong>and</strong> with ˜b = [b m−1 , 0] T <strong>and</strong> we see that sign(ˆq m−1 ) = sign(b m−1 )must hold. Then ˆq m is given by√ˆq m = ± 1 − ˆq 1 2 − ˆq2 2 − . . . − ˆq2 m−1 , (43)due to <strong>the</strong> constraint q T q = 1. If <strong>the</strong> expression in (43) sums up to 0 <strong>the</strong>re isonly one minimizer to (10), o<strong>the</strong>rwise <strong>the</strong>re are two. ✷References[1] M. D. Akca. Generalized <strong>Procrustes</strong> Analysis <strong>and</strong> its Applications in Photogrammetry.ETH, Swiss Federal Institute of Technology Zurich, Instituteof Geodesy <strong>and</strong> Photogrammetry, 2003. Prepared <strong>for</strong>: Praktikum in Photogrammetrie,Fernerkundung und GIS.[2] J. Balog, T. Csendes, <strong>and</strong> T. Rapcsák. Some global optimization problemson stiefel manifolds. J. Global Optimization, 30(1):91–101, 2004.[3] M. T. Chu <strong>and</strong> N. T. Trendafilov. On a Differential Equation Approach to<strong>the</strong> <strong>Weighted</strong> <strong>Orthogonal</strong> <strong>Procrustes</strong> <strong>Problem</strong>. Statistics <strong>and</strong> Computing,8(2):125–133, 1998.


On <strong>the</strong> Number of Minima to <strong>Weighted</strong> <strong>Orthogonal</strong> <strong>Procrustes</strong> <strong>Problem</strong>s 105[4] M. T. Chu <strong>and</strong> N. T. Trendafilov. The <strong>Orthogonal</strong>ly Constrained RegressionRevisted. J. Comput. Graph. Stat., 10:746–771, 2001.[5] T. F. Cox <strong>and</strong> M. A. A. Cox. Multidimensional scaling. Chapman & Hall,1994.[6] A. Edelman, T. A. Arias, <strong>and</strong> S. T. Smith. The Geometry of <strong>Algorithms</strong>with <strong>Orthogonal</strong>ity Constraints. SIAM Journal on Matrix Analysis <strong>and</strong>Applications, 20(2):303–353, 1998.[7] L. Eldén. Solving Quadratically Constrained Least Squares <strong>Problem</strong>s Usinga Differential-Geometric Approach. BIT Numerical Ma<strong>the</strong>matics, 42(2),2002.[8] L. Eldén <strong>and</strong> H. Park. A <strong>Procrustes</strong> problem on <strong>the</strong> Stiefel manifold.Numer. Math., 82(4):599–619, 1999.[9] G. E. Forsy<strong>the</strong> <strong>and</strong> G. H. Golub. On <strong>the</strong> Stationary Values of a SeconddegreePolynomial on <strong>the</strong> Unit Sphere. J. Soc. Indust. Appl. Math., 13(4),1965.[10] W. G<strong>and</strong>er. Least Squares with a Quadratic Constraint. Numer. Math.,36:291–307, 1981.[11] G. H. Golub <strong>and</strong> C. F. Van Loan. Matrix Computations. The Johns HopkinsUniversity Press, 1989.[12] J. C. Gower. Multivariate Analysis: Ordination, Multidimensional Scaling<strong>and</strong> Allied Topics. H<strong>and</strong>book of Applicable Ma<strong>the</strong>matics, VI:Statistics(B),1984.[13] J. C. Gower <strong>and</strong> G. B. Dijksterhuis. <strong>Procrustes</strong> problems. Ox<strong>for</strong>d UniversityPress, 2004.[14] M. A. Koschat <strong>and</strong> D. F. Swayne. A Weig<strong>the</strong>d <strong>Procrustes</strong> Criterion. Psychometrika,56(2):229–239, 1991.[15] A. Mooijaart <strong>and</strong> J. J. F. Comm<strong>and</strong>eur. A General Solution of <strong>the</strong> Weig<strong>the</strong>dOrthonormal <strong>Procrustes</strong> <strong>Problem</strong>. Psychometrika, 55(4):657–663, 1990.[16] T. Rapcsak. On Minimization on Stiefel manifolds. European J. Oper.Res., 143(2):365–376, 2002.[17] I. Söderkvist. Some Numerical Methods <strong>for</strong> Kinematical Analysis. ISSN-0348-0542, UMINF-186.90, Department of Computing Science, Umeå University,1990.[18] I. Söderkvist <strong>and</strong> Per-Åke Wedin. On Condition Numbers <strong>and</strong> <strong>Algorithms</strong><strong>for</strong> Determining a Rigid Body Movement. BIT, 34:424–436, 1994.


106 Paper III[19] E. Stiefel. Richtungsfelder und Fernparallelismus in n-dimensionalen Mannigfaltigkeiten.Commentarii Math. Helvetici, 8:305–353, 1935-1936.[20] T. Vikl<strong>and</strong>s. On Global Minimization of <strong>Weighted</strong> <strong>Orthogonal</strong> <strong>Procrustes</strong><strong>Problem</strong>s. Technical Report UMINF-06.09, Department of Computing Science,Umeå University, Umeå, Sweden, 2006.[21] T. Vikl<strong>and</strong>s <strong>and</strong> P. Å . Wedin. <strong>Algorithms</strong> <strong>for</strong> Linear Least Squares <strong>Problem</strong>son <strong>the</strong> Stiefel manifold. Technical Report UMINF-06.07, Departmentof Computing Science, Umeå University, Umeå, Sweden, 2006.[22] P. Å . Wedin <strong>and</strong> T. Vikl<strong>and</strong>s. <strong>Algorithms</strong> <strong>for</strong> 3-dimensional <strong>Weighted</strong><strong>Orthogonal</strong> <strong>Procrustes</strong> <strong>Problem</strong>s. Technical Report UMINF-06.06, Departmentof Computing Science, Umeå University, Umeå, Sweden, 2006.


Paper IVOn Global Minimization of <strong>Weighted</strong><strong>Orthogonal</strong> <strong>Procrustes</strong> <strong>Problem</strong>s ∗Thomas Vikl<strong>and</strong>s †Department of Computing Science, Umeå UniversitySE-901 87 Umeå, Sweden.vikl<strong>and</strong>s@cs.umu.seAbstractA weighted orthogonal <strong>Procrustes</strong> problem (WOPP) can be written asmin 1 2 ||AQX −B||2 F, subject to Q T Q = I n. Here Q ∈ R m×n , with n ≤ m,has orthonormal columns <strong>and</strong> A, X <strong>and</strong> B are known matrices. <strong>Problem</strong>sof this kind can have more than one minimum. The maximal amount ofminima seems to comply with <strong>the</strong> <strong>for</strong>mula 2 n , i.e., only depending on <strong>the</strong>number of columns in Q. This paper proposes an algorithm <strong>for</strong> computingall, or near all, minima to a WOPP. The algorithm uses <strong>the</strong> normal planeof AQX at a computed minimum Q = ˆQ, to calculate a set of 2 n matriceswith orthonormal columns. Remarkably, any of <strong>the</strong>se matrices lies in <strong>the</strong>vicinity of o<strong>the</strong>r minima (if any) due to <strong>the</strong> special geometry of <strong>the</strong> surfaceof AQX.Keywords : Matrices, weighted, orthogonal, <strong>Procrustes</strong>, ellipses, normalplane, global minimum, Stiefel manifold, global minimization, optimization, algorithms.∗ From UMINF-06.09, 2006. Submitted to BIT.† Financial support has partly been provided by <strong>the</strong> Swedish Foundation <strong>for</strong> Strategic Researchunder <strong>the</strong> frame program grant A3 02:128.109


110 Paper IVContents1 Introduction. 1112 Geometry in <strong>the</strong> case when Q ∈ R m×1 , m > 1 1133 Higher dimensional cases 1154 The Riccati normals 1164.1 Computing all symmetric solutions . . . . . . . . . . . . . . . . . 1175 Computational experiments 1205.1 Generating test problems . . . . . . . . . . . . . . . . . . . . . . 1205.2 Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1216 Conclusions 122A The canonical <strong>for</strong>m of a WOPP 123B Normal plane intersections <strong>for</strong> Q ∈ R 2×2 124C Normal plane intersections <strong>for</strong> a simple OPP 124D CARE : Multiple eigenvalues 125E Tables <strong>for</strong> <strong>the</strong> computational experiments 127E.1 <strong>Problem</strong>s of dimension n < 7 . . . . . . . . . . . . . . . . . . . . 127E.2 Higher dimensional problems . . . . . . . . . . . . . . . . . . . . 135E.3 Results when using ɛ = 1 . . . . . . . . . . . . . . . . . . . . . . 135References 139


On Global Minimization of <strong>Weighted</strong> <strong>Orthogonal</strong> <strong>Procrustes</strong> <strong>Problem</strong>s 1111 Introduction.A weighted orthogonal <strong>Procrustes</strong> problem (WOPP) can be <strong>for</strong>mulated asmin ||AQX − B|| F , subject to Q T Q = I n , (1)where A ∈ R m×m , X ∈ R n×n <strong>and</strong> B ∈ R m×n are known matrices withrank(A) = m, rank(X) = n <strong>and</strong> n ≤ m. Equation (1) is an optimizationproblem to be solved on <strong>the</strong> Stiefel manifold [16]V m,n = {Q ∈ R m×n : Q T Q = I n , n ≤ m}.Additionally, we can assume that A <strong>and</strong> X are diagonal matrices accordingto A = diag(α 1 , ..., α m ) <strong>and</strong> X = diag(χ 1 , ..., χ n ) with α i ≥ α i+1 > 0 <strong>and</strong>χ i ≥ χ i+1 > 0 respectively, see Appendix A. We call this <strong>the</strong> canonical <strong>for</strong>m ofa WOPP.For some special cases, e.g., if X = I n <strong>and</strong> m = n, (1) specializes to <strong>the</strong>orthogonal <strong>Procrustes</strong> problem (OPP),min 1 2 ||AQ − B||2 F , subject to Q T Q = I m . (2)This problem has an analytic solution that can be derived from <strong>the</strong> singularvalue decomposition of B T A [7].Generally a solution to (1) can not computed explicitly as in <strong>the</strong> OPP case.An iterative method is needed to solve (1). Earlier work in connection to iterativealgorithms <strong>for</strong> solving problems similar to (1) is reported in [1–6,8,10,13–15,20]. Moreover, a WOPP can have several minima, also observed by o<strong>the</strong>rs[1,2,5, 6,8, 10].This paper presents an algorithm that uses some geometrical properties of<strong>the</strong> surface of AQX (<strong>the</strong> normal plane) to compute all minima to (1). To explainhow <strong>the</strong> algorithm works, some definitions in connection to <strong>the</strong> geometry of <strong>the</strong>surface of AQX are needed.Definition 1.1 Letdenote <strong>the</strong> surface of AQX.F = {Y = AQX | Q ∈ V m,n }F is also a differentiable manifold, similar to V m,n . For a geometrical underst<strong>and</strong>ingit is preferred to have <strong>the</strong> surface F embedded in R mn . This isachieved by using <strong>the</strong> traditional vec-operator as vec(AQX) = Fvec(Q) whereF = X T ⊗ A <strong>and</strong> ⊗ is <strong>the</strong> Kronecker product. Hence F ∈ R mn×mn is also adiagonal matrix. With b = vec(B), (1) is equivalent tomin 1 2 ||Fvec(Q) − b||2 2 , subject to Q T Q = I n . (3)This <strong>for</strong>mulation is used in later sections when looking at some special cases tomotivate <strong>the</strong> ability of <strong>the</strong> algorithm to compute additional minima.


112 Paper IVFor algebraic manipulations <strong>and</strong> analysis it might be preferred to work inR m×n instead of R mn , hence to stay consequent, all definitions are based onR m×n . Ei<strong>the</strong>r way, <strong>the</strong>y are of course equivalent.Definition 1.2 The tangent space of F at a given point ˜Q isT s = {T = A( ˜QS + (I − ˜Q ˜Q T )C)X}, (4)where C ∈ R m×n is arbitrary <strong>and</strong> S ∈ R n×n is skew-symmetric.For a suitable parametrization of F, T s is <strong>the</strong> set of all tangent directions of Fat a point ˜Q. The expression <strong>for</strong> T s in (4) is simply derived by using <strong>the</strong> tangentspace of V m,n , <strong>for</strong> details see [18].The algorithm presented in this paper uses <strong>the</strong> normal plane of F to computea set of matrices Q ⊂ V m,n . It is expected that any Q ∈ Q is in <strong>the</strong> vicinity ofany o<strong>the</strong>r minimizer (if any) to (1). To define <strong>the</strong> normal plane we first need todefine <strong>the</strong> normal space of F.Definition 1.3 The normal space of F at a point ˜Q isN s = {N = A −1 ˜QGX −1 | G ∈ R n×n , G = G T }.At a given point ˜Q, N s is <strong>the</strong> set of all vectors that is orthogonal to <strong>the</strong> tangentspace T s of F at ˜Q. <strong>Orthogonal</strong> here means with respect to <strong>the</strong> Euclidean innerproduct, i.e.,tr(N T T) = 0<strong>for</strong> all N ∈ N s <strong>and</strong> T ∈ T s at a point ˜Q. With N = A −1 ˜QGX −1 <strong>and</strong> T =A( ˜QS + (I − ˜Q ˜Q T )C)X <strong>and</strong> if G is symmetric <strong>and</strong> S is skew-symmetric, <strong>the</strong>ntr(N T T) = tr(X −1 G T ˜QT A −1 A( ˜QS + (I − ˜Q ˜Q T )C)X) =tr(G T ˜QT ( ˜QS + (I − ˜Q ˜Q T )C)) = tr(G T S) = 0The normal space itself is not used to any extent later on. However, its parametrizationN = A −1 ˜QGX −1 is used, <strong>and</strong> also <strong>the</strong> normal plane is defined by using<strong>the</strong> normal space.Definition 1.4 The normal plane of F at a point ˜Q isN p = {N p = A ˜QX + N | N ∈ N s }.N p is <strong>the</strong> normal space at ˜Q translated to <strong>the</strong> point on F where it is defined,i.e., A ˜QX.The set Q mentioned above corresponds to <strong>the</strong> intersections of F <strong>and</strong> N pdefined at a point.Definition 1.5 Given a normal plane N p at a point ˜Q, letQ( ˜Q) = {Q ∈ V m,n : AQX = N p }be <strong>the</strong> set of intersections of N p <strong>and</strong> F.


On Global Minimization of <strong>Weighted</strong> <strong>Orthogonal</strong> <strong>Procrustes</strong> <strong>Problem</strong>s 113In Section 4, we show that <strong>the</strong>re are 2 n points of intersection. For somespecial setups of (1), a continuum of solutions might arise. These cases aresimilar to <strong>the</strong> task of computing <strong>the</strong> eigenvectors of a matrix whose eigenvaluescome with multiplicity as, e.g., <strong>the</strong> identity matrix.Anyhow, having computed one minimizer ˆQ, <strong>the</strong> algorithm computes <strong>the</strong> setQ( ˆQ). Any Q ∈ Q( ˆQ) should be a good starting approximation <strong>for</strong> a nonlinearsolver in order to get convergence towards o<strong>the</strong>r minimizers of (1). The algorithmworks as follows.Normal plane Algorithm : Computes all minima to a WOPP.1. Compute a solution ˆQ to (1) with some solver, e.g., [19].2. Compute <strong>the</strong> set Q( ˆQ).3. <strong>for</strong> each Q i ∈ Q( ˆQ)4. end <strong>for</strong>.3.1. Use Q i as <strong>the</strong> initial value <strong>for</strong> <strong>the</strong> solver to compute a solution ˆQ i to(1).3.2. Save <strong>the</strong> solution ˆQ i .In <strong>the</strong> following two sections we consider some special cases of (1) <strong>and</strong> motivatea study on how well <strong>the</strong> set Q( ˆQ) works <strong>for</strong> more general, higher dimensional,cases. In Section 4, an algorithm that computes <strong>the</strong> set Q is presented.Finally, results of computational experiments <strong>and</strong> empirical observations, as <strong>the</strong>maximal amount of minimizers to a WOPP, are presented in Section 5-6.2 Geometry in <strong>the</strong> case when Q ∈ R m×1 , m > 1The very simplest case is when Q ∈ R 2×1 , <strong>and</strong> we use <strong>the</strong> parametrizationQ = [cos(φ), sin(φ)] T :Fvec(Q) =[α1 00 α 2] [ cos(φ)sin(φ)F is an ellipse in R 2 with semi major <strong>and</strong> minor axes α 1 <strong>and</strong> α 2 , respectively.The optimization problem consists of finding <strong>the</strong> point on this ellipse that isclosest to <strong>the</strong> point b. As seen in Figure 1, if <strong>the</strong>re exists an additional minimumˆQ 2 apart from ˆQ, it will be in <strong>the</strong> vicinity of where <strong>the</strong> normal at Fvec( ˆQ)intersects F at Fvec(Q 2 ). Especially if b = 0, <strong>the</strong>n <strong>the</strong> intersection occursexactly at <strong>the</strong> o<strong>the</strong>r minimum, i.e., Q 2 = ˆQ 2 .To derive <strong>the</strong> intersection we can use <strong>the</strong> equationFQ 2 = F ˆQ + N ⇒ Q 2 = ˆQ + F −1 N, (5)].


114 Paper IVx 2bα 2Fvec( ˆQ)Tx 1N pα 1Fvec(Q 2 ) Fvec( ˆQ 2 )Figure 1: Two minima are found where <strong>the</strong> tangent is orthogonal to <strong>the</strong> distancevector from b to <strong>the</strong> ellipse F. Here ˆQ is <strong>the</strong> global minimizer. The normalplane at Fvec( ˆQ) intersects F at Fvec(Q 2 ) which is in <strong>the</strong> vicinity of <strong>the</strong> o<strong>the</strong>rminimum Fvec( ˆQ 2 ).where N ∈ R 2×1 is defined according to Definition 1.3, i.e., N = F −1 ˆQG whereG ∈ R 1 . Applying <strong>the</strong> condition that Q T 2 Q 2 = I 1 = 1 yields <strong>the</strong> quadraticequationQ T 2 Q 2 = 1 + 2 ˆQ T F −2 ˆQG + ˆQT F −4 F −1 ˆQG 2 = 1 (6)to be solved <strong>for</strong> G. One solution is G = 0 which is of no interest, whilst <strong>the</strong>o<strong>the</strong>r solution lets us compute Q 2 from (5). Q 2 is now be a good startingapproximation <strong>for</strong> a nonlinear solver in order to compute <strong>the</strong> second minimumˆQ 2 .Doing this geometric investigation when Q ∈ R 3×1 yields <strong>the</strong> same result.Also generalizing this to <strong>the</strong> case when Q ∈ R m×1 gives <strong>the</strong> same equation. F is<strong>for</strong> <strong>the</strong>se special cases a hyper ellipsoid in R mn , so <strong>the</strong> normal plane is a normalvector, hence G is a scalar.At a minimum ˆQ <strong>the</strong> residual is r = b − Fvec( ˆQ) <strong>and</strong> r ∈ N s . Evidentlyif <strong>the</strong> magnitude of <strong>the</strong> residual is large in relationship to F <strong>and</strong> is pointinginwards <strong>the</strong> surface F, <strong>the</strong>n Fvec(Q 2 ) gives a smaller objective function valuethan Fvec( ˆQ). Hence <strong>the</strong>re must be ano<strong>the</strong>r minimizer in <strong>the</strong> vicinity of Q 2 .A somewhat similar case is when Q ∈ R 2×2 , see Appendix B <strong>for</strong> details.


On Global Minimization of <strong>Weighted</strong> <strong>Orthogonal</strong> <strong>Procrustes</strong> <strong>Problem</strong>s 1153 Higher dimensional casesFor <strong>the</strong> case when Q ∈ R 3×2 <strong>the</strong> surface F is much harder to depict. However,by choosing a parametrization as⎡Q ± (φ) = ⎣ 0 cos(φ) ⎤0 sin(φ) ⎦,±1 0<strong>and</strong> embedding <strong>the</strong> surface F in R 6 , we get <strong>the</strong> appearance as shown in Figure2. The x, y, z-axes correspond to <strong>the</strong> axes ±[0, 0, 1, 0, 0, 0] T , ±[0, 0, 0, 0, 1, 0] T<strong>and</strong> ±[0, 0, 0, 1, 0, 0] T directions, respectively. The tangent plane only has onecomponent in this space, just as in <strong>the</strong> case with an ellipse, hence <strong>the</strong> normalplane is a two dimensional plane (in this R 3 subspace). Assume that ˆQ is aminimum with b lying in <strong>the</strong> (x, y)-plane. Solving <strong>the</strong> set of quadratic equationsresults in four 1 solutions giving <strong>the</strong> directions v 1 = 0, v 2 , v 3 <strong>and</strong> v 4 . Here Q 3 ,<strong>and</strong> possibly Q 4 , yields a smaller residual than ˆQ 1 , so <strong>the</strong>re must be at leastone minimum with smaller residual in <strong>the</strong> vicinity of Q 3 or Q 4 .zyv4Fvec( ˆQ)Tv2Fvec(Q3)xv3brNpFigure 2: In this x, y, z subspace of R 6 , F consists of two ellipses each lying in aplane, parallel to <strong>the</strong> y, z-plane, at an equal distance from origin. Suppose b, <strong>for</strong>simplicity, is lying in <strong>the</strong> x, y-plane <strong>and</strong> let ˆQ be a minimizer. T p is <strong>the</strong> tangentplane component in this subspace. The residual component is r = b − Fvec( ˆQ).v 2 , v 3 <strong>and</strong> v 4 are vectors in <strong>the</strong> normal plane N such that Fvec(Q i ) = Fvec( ˆQ)+v i .1 Since <strong>the</strong> normal plane also has an additional component, orthogonal to this space, onemight think that N can intersect F in a point not in this space. That is, we would get someQ 5 belonging to normal directions orthogonal to this space. This is not <strong>the</strong> case since <strong>the</strong>reare only 2 n intersections, see Section 4.


116 Paper IVIf m = n, <strong>the</strong>n an OPP has two minima ˜Q + <strong>and</strong> ˜Q − , with different determinantsigns. If additionally A = X = I m , <strong>the</strong>n <strong>the</strong> normal plane at ˜Q + intersectF at ˜Q − , <strong>and</strong> conversely, N p at ˜Q − intersect F at ˜Q + . See Appendix C <strong>for</strong>details.If B = 0, α i > α i+1 ∀ i = 1, ..., m <strong>and</strong> χ j > χ j+1 ∀ j = 1, ..., n, <strong>the</strong>n <strong>for</strong> agiven minimizer ˆQ <strong>the</strong> normal plane intersect F exactly at every o<strong>the</strong>r minima.In [18], it is shown that <strong>for</strong> B = 0 <strong>the</strong>re exist 2 n minima. Any minimizer ˆQ i ,i = 1, . . .,2 n , is on <strong>the</strong> <strong>for</strong>m[ ] Z ˆQ i = , (7)Kwhere Z ∈ R (m−n)× is a zero matrix <strong>and</strong> K ∈ R n×n is an anti-diagonal matrixwith arbitrary ±1 as elements.For simplicity, let ˆQ = ˆQ 1 be a minimizer with K = anti-diag(1, 1, ..., 1). Theintersections of <strong>the</strong> normal plane <strong>and</strong> F can be found by deriving a symmetricmatrix G ∈ R n×n yielding a normal A −1 ˆQ1 GX −1 , such thatA ˆQ 1 X + A −1 ˆQ1 GX −1 = AQ i X ⇒ ˆQ 1 + A −2 ˆQ1 GX −2 = Q i ,where Q i has orthonormal columns. The m − n first rows of ˆQ 1 are zero so wedo a partitioning as follows:[ ZK] [ A−2+1 00 A −22][ ZK] [GX −2 =]ZK + A −22 KGX −2 = Q i .Evidently, <strong>the</strong> first m − n rows in any Q i are also zero, hence <strong>the</strong> intersectionsare obtained by choosing G such that K + A −22 KGX −2 is orthogonal. TakeG = diag(g 1 , g 2 , ..., g n ), <strong>the</strong>nK + A −22 KGX −2 == anti-diag(1, ..., 1) + anti-diag(g 1 /(α 2 mχ 2 1), g 2 /(α 2 m−1χ 2 2), ..., g n /(α 2 m−n+1χ 2 n)).Taking different choices of G as combinations of g j = −2(α 2 m−j+1 χ2 j ) or g j = 0results in 2 n points of intersections Q i , i = 1, ..., 2 n . Each of <strong>the</strong>se points is on<strong>the</strong> <strong>for</strong>m given in (7), i.e., any Q i is a minimizer.The examples given here are to <strong>the</strong> least simple <strong>and</strong> very special, never<strong>the</strong>less<strong>the</strong>y motivate an investigation on how well <strong>the</strong> usage of normal plane directionswork <strong>for</strong> computing all minima to a general setup of a WOPP.4 The Riccati normalsIn this section, an algorithm that computes <strong>the</strong> intersections of N p <strong>and</strong> F ispresented. This is done by computing all real solutions to a continuous algebraicRiccati equation (CARE). The <strong>the</strong>ory of solving a CARE is well known, e.g.,see [9, 12].


On Global Minimization of <strong>Weighted</strong> <strong>Orthogonal</strong> <strong>Procrustes</strong> <strong>Problem</strong>s 117As was done in <strong>the</strong> previous section, <strong>the</strong> set of intersections Q( ˜Q) at a givenpoint ˜Q can be derived from computing different symmetric matrices G i ∈ R n×n ,i = 1, ..., s so thatA ˜QX + A −1 ˜QGi X −1 = AQ i X ⇒ ˜Q + A −2 ˜QGi X −2 = Q i , (8)where Q i ∈ V m,n . For now assume <strong>the</strong>re are a finite number of intersections s.In order to solve (8) <strong>the</strong> condition that Q T i Q i = I n is applied, which yields <strong>the</strong>equationQ T i Q i = I n = I n + C T G i + G i C + G i RG i ⇒C T G i + G i C + G i RG i = 0, (9)where C = ˜Q T A −2 ˜QX 2 ∈ R n×n <strong>and</strong> R = ˜Q T A −4 ˜Q ∈ R n×n . R is symmetric(<strong>and</strong> positive definite) so (9) is a continuous algebraic Riccati equation. Asymmetric solution G i to (9), lets us compute a component in <strong>the</strong> normal spaceN i = A −1 ˜QGi X −1 ∈ N s , such that <strong>the</strong> normal plane at ˜Q intersects F at Q i .Definition 4.1 At a point ˜Q let N i ∈ N s yield in an intersection of F <strong>and</strong> N paccording toA ˜QX + N i = AQ i X : Q i ∈ V m,n .With a ρ > 0 such that ρN i is normalized, we call ρN i a Riccati normal to Fat ˜Q.<strong>Algorithms</strong> developed to solve (9), some mentioned in [11], mainly computes<strong>the</strong> stabilizing solution. A solution G + is said to be <strong>the</strong> maximal solution if(G + − G i ) is positive semi-definite <strong>for</strong> all solutions G i . G + is unique <strong>and</strong> <strong>the</strong>eigenvalues 2 of C + RG + fulfills λ(C + RG + ) ≥ 0 [9]. If λ(C + RG + ) > 0 <strong>the</strong>nG + is said to be stabilizing. As opposed to G + <strong>the</strong>re exist a unique minimalsolution G − such that G i −G − is positive semi-definite <strong>for</strong> all solutions G i , <strong>and</strong>λ(C + RG − ) ≤ 0.4.1 Computing all symmetric solutionsIn this section, we assume that <strong>the</strong> eigenvalues of C are distinct. Then <strong>the</strong>reare 2 n symmetric solutions G i = G T i to (9) [12]. We wish to compute <strong>the</strong>m all.First construct <strong>the</strong> matrix[ ]C RM =0 −C T .Observe that if[ ] [ ] [ ]C R I I0 −C T = ZG i G i2 For a matrix H ∈ R m×m , λ(H) denotes <strong>the</strong> set of eigenvalues λ i , i = 1, . . . , m, of H. Bywriting, e.g., λ(H) > 0 we mean that λ i > 0 ∀ i = 1, ...,m.


118 Paper IV<strong>for</strong> some matrix Z ∈ R n×n , <strong>the</strong>nZ = C + RG i , −C T G i = G i Z ⇒ C T G i + G i C + G i RG i = 0. (10)The symmetric solutions are computed by using different subspace combinationsof <strong>the</strong> eigenvectors of M. LetM = V ΛV −1 =[V1,1 V 1,20 V 2,2] [Λ1 0][ V−11,1 −V1,1 −1 V 1,2V2,2−10 Λ 20 V −12,2be a spectral decomposition of M. Due to <strong>the</strong> upper-triangular <strong>for</strong>m of M, <strong>the</strong>eigenvalues of M are λ(M) = {λ(C), −λ(C)}. Additionally, <strong>the</strong>y are orderedsuch that <strong>the</strong> diagonal matrices Λ 1 <strong>and</strong> Λ 2 corresponds to <strong>the</strong> eigenvalues λ(C)<strong>and</strong> −λ(C), respectively.Lemma 4.1 The n eigenvalues of C ∈ R n×n fulfills λ(C) > 0 (<strong>and</strong> <strong>the</strong>n trivially−λ(C) < 0).Proof. Observe that ˜Q T A −2 ˜Q is positive definite. Then let L T L = ˜Q T A −2 ˜Qbe a Cholesky factorization. Since X is a diagonal matrix, <strong>the</strong> eigenvalues ofC = L T LX 2 are <strong>the</strong>n real <strong>and</strong> strictly positive,λ(C) = λ(L T LX 2 ) = λ(L −T L T LX 2 L T ) = λ(LX 2 L T ) > 0,since LX 2 L T is symmetric <strong>and</strong> positive definite. ✷The above lemma results in that <strong>the</strong> maximal solution, <strong>for</strong> our problem (9),is always G + = 0 because λ(C + RG + ) = λ(C) > 0 if G + = 0.Definition 4.2 Let Γ be a maximal set of eigenvalues of M, such that <strong>for</strong>any λ i ∈ Γ yields −λ i ∉ Γ. Γ will contain n eigenvalues, since λ(M) ={λ(C), −λ(C)}. Additionally 2 n different sets Γ j , j = 1, . . . , 2 n , can be constructed.]By using <strong>the</strong> eigenvectors of M corresponding to a set of eigenvalues Γ, we cancompute a symmetric solution. Let Ṽ = [ṽ 1, . . .,ṽ n ] ∈ R 2n×n be <strong>the</strong> correspondingeigenvectors to a set Γ. Make a partition on <strong>the</strong> <strong>for</strong>m[ ]Ṽ1Ṽ = , Ṽ i ∈ R n×n , i = 1, 2.Ṽ 2Then <strong>the</strong>re exists a unique symmetric solution G i , such that <strong>the</strong> eigenvalues−1λ(C + RG i ) is exactly <strong>the</strong> same as Γ, <strong>and</strong> G i = Ṽ2Ṽ1 [9, 12].For instance, <strong>the</strong> minimal solution G − that fulfils λ(C + RG − ) < 0, iscomputed by using <strong>the</strong> eigenvectors corresponding to <strong>the</strong> negative eigenvaluesof M. Then we must pick Γ = λ(Λ 2 ) = −λ(C). The corresponding eigenvectorsyieldG − = V 2,2 V1,2 −1 . (11)


On Global Minimization of <strong>Weighted</strong> <strong>Orthogonal</strong> <strong>Procrustes</strong> <strong>Problem</strong>s 119To see that (11) corresponds to <strong>the</strong> negative eigenvalues of M, use <strong>the</strong> spectraldecomposition to getC + RG − = V 1,1 Λ 1 V −11,1+ (−V−11,1 V 1,2V −12,2 + V 1,2Λ 2 V −12,2 )G − = Z ⇒C + RG − = V 1,2 Λ 2 V −11,2 = Z ⇒ λ(C + RG −) = λ(Λ 2 ) = −λ(C) < 0.Finally, using Z = V 1,2 Λ 2 V1,2 −1 yields a solution to <strong>the</strong> CARE, since from (10)we get−C T G − = V 2,2 Λ 2 V2,2 −1 G − = V 2,2 Λ 2 V1,2 −1 = G −Z.Now we can conclude that (11) is <strong>the</strong> minimal solution. The same can be done<strong>for</strong> <strong>the</strong> o<strong>the</strong>r symmetric solutions, resulting in a very simple algorithm. Fordeeper <strong>the</strong>ory <strong>and</strong> underst<strong>and</strong>ing we refer to [9,12].Algorithm to compute <strong>the</strong> set Q.1. Input : ˜Q, A <strong>and</strong> X.2. Q = ∅.3. C = ˜Q T A −2 ˜QX 2 ∈ R n×n <strong>and</strong> R = ˜Q T A −4 ˜Q ∈ R n×n .4.5. [V, Λ] = eig(M).6. <strong>for</strong> j = 1 to 2 n .6.1. LetM =[ ] C R0 −C T .Ṽ =[Ṽ1Ṽ 2]be a collection of eigenvectors corresponding to a Γ j according toDefinition 4.2.−16.3. G j = Ṽ2Ṽ1 .6.2. Q j = ˜Q + A −2 ˜QGj X −2 .6.5. Q = {Q, Q j }.7. end <strong>for</strong>.This algorithm works very well to compute <strong>the</strong> solutions to (9) if C has distincteigenvalues. For some special setups of A, X <strong>and</strong> ˜Q yielding a C that havemultiple eigenvalues, <strong>the</strong>re will be a continuum of solutions. Then <strong>the</strong> sets ofeigenvalues Γ j , according to Definition 4.2, can not contain n eigenvalues. Also<strong>the</strong> eigenvectors of M are <strong>the</strong>n not uniquely defined. In <strong>the</strong>se cases, computing2 n solutions can be done in ano<strong>the</strong>r way. The method is described in AppendixD. It uses <strong>the</strong> maximal <strong>and</strong> minimal solutions G + <strong>and</strong> G − <strong>and</strong> <strong>the</strong> eigenvectorsof C, to per<strong>for</strong>m oblique projections resulting in additional symmetric solutions.


120 Paper IV5 Computational experimentsIn this section, some results regarding <strong>the</strong> efficiency <strong>and</strong> reliability of <strong>the</strong> normalplane algorithm <strong>for</strong> computing all minimizers to a WOPP are presented.5.1 Generating test problemsThe matrices A <strong>and</strong> X are r<strong>and</strong>omly generated diagonal matrices with α i ≥α i+1 > 0 <strong>and</strong> χ i ≥ χ i+1 > 0, with different condition numbers κ(A) <strong>and</strong> κ(X).An exact solution ˆQ ∈ R m×n is r<strong>and</strong>omly generated to get <strong>the</strong> exact ˆB =A ˆQX. Several types of perturbations have been considered. The results presentedare based on a perturbationB = ( ˆB + ˜B)ɛ, (12)where ˜B ∈ R m×n <strong>and</strong> ɛ > 0 is a scalar. ˜B is chosen to lie in <strong>the</strong> normal spaceN s at ˆQ. The magnitude of ˜B was chosen as || ˜B|| F ≈ 0.15|| ˆB|| F .We consider <strong>the</strong> scale factor ɛ with 0 ≤ ɛ ≤ 1. It has been noted that <strong>for</strong>ɛ < 1, <strong>the</strong> generated WOPPs more often have several minima. In AppendixE.3, some results when ɛ = 1 is presented.In order to know if <strong>the</strong> normal plane algorithm finds all minima, <strong>and</strong> notmissing some, <strong>the</strong> following heuristic global optimization method was used.Heuristic Global Optimization Algorithm0. k := Number of minima found by <strong>the</strong> normal plane algorithm.1. Fails := 0. How many minima <strong>the</strong> normal plane method failed to compute.2. i := 0. Used to count <strong>the</strong> number of minima found.3. c k := 0. Used to count how many Q 0 needed be<strong>for</strong>e finding k minima.4. while i < k4.1. Q 0 := R<strong>and</strong>om matrix with orthogonal columns.4.2. ˆQ := Computed minimum with Q0 as an initial matrix <strong>for</strong> a nonlinearsolver.4.3. If ˆQ is a new minimum4.3.1. Save ˆQ.4.3.2. i := i + 1.4.4. end if.4.5. c k := c k + 1.4.6. end while.


On Global Minimization of <strong>Weighted</strong> <strong>Orthogonal</strong> <strong>Procrustes</strong> <strong>Problem</strong>s 1215. Now this method has found <strong>the</strong> same amount of minima as <strong>the</strong> normal planealgorithm, after using c k different initial matrices.6. We continue to search <strong>for</strong> additional minima. Let c e be <strong>the</strong> number of howmany extra r<strong>and</strong>om matrices to be used, e.g., c e := max(100, min(10 ·c k , 10000)).7. j := 0.8. while j < c e8.1 Q 0 := R<strong>and</strong>om matrix with orthogonal columns.8.2 ˆQ := Computed minimum with Q 0 as an initial matrix <strong>for</strong> a nonlinearsolver.8.3 If ˆQ is a new minimum8.3.1 Fails := Fails + 1. A new minimum is found, i.e., <strong>the</strong> normalplane algorithm failed to compute <strong>the</strong>m all.8.4 end if.8.5 j := j + 1.9. end while.10. Check if <strong>the</strong> saved minima are <strong>the</strong> same ones as found by <strong>the</strong> normal planealgorithm.If <strong>the</strong> normal plane algorithm finds k minima, this heuristic algorithm generatesdifferent r<strong>and</strong>om matrices Q 0 <strong>and</strong> uses <strong>the</strong>m as <strong>the</strong> initial value <strong>for</strong> anonlinear solver to compute a minimum. c k is <strong>the</strong> number of how many matricesQ 0 was needed to find k minima. Having found k minima, <strong>the</strong> heuristicmethod continues to generate c e additional initial matrices. This is done tocheck <strong>for</strong> more minima, in case <strong>the</strong> normal plane algorithm failed in finding all.5.2 TablesEach table in Appendix E is a collection of results <strong>for</strong> different test problems ofa specific dimension. The tables display <strong>the</strong> following in<strong>for</strong>mation.


122 Paper IVNo. ˆQi The number of minima found by <strong>the</strong> normal plane algorithm.c k , c e The number of r<strong>and</strong>om matrices Q 0 used by <strong>the</strong> heuristicmethod, described above.Fails The number of additional minima found by <strong>the</strong> heuristicmethod.κ(A) Condition number of A.κ(X) Condition number of X.η To get a relative distance when presenting <strong>the</strong> residual norm,a relative residual norm is used asη = ||A ˆQX − B|| F||A ˆQX|| Fwhere ˆQ is <strong>the</strong> global minimum.ɛ Scale factor used when generating a test problem (12).As an example consider <strong>the</strong> sixth row in Table 4 in Appendix E. For this testproblem, <strong>the</strong> normal plane algorithm found 14 minima. The heuristic methodfound 14 minima after using 29 r<strong>and</strong>omly generated initial matrices Q 0 . Theheuristic method continued with 290 extra initial matrices <strong>and</strong> found two moreminima, seen under Fails. Hence <strong>the</strong> normal plane method failed to compute 2minima.6 ConclusionsThe computational experiments reported in Appendix E indicate that <strong>the</strong> normalplane algorithm manage <strong>the</strong> task of finding all, or at least several, minimato a WOPP. But it is expected that failures can indeed occur. Even though anyQ i ∈ Q might be in <strong>the</strong> vicinity of an additional minimizer ˆQ i , getting convergencetowards a different minimum ˆQ k , k ≠ i, can happen. The choice of steplengths, type of nonlinear solver used when solving (1), type of parametrizationof of Q, etcetera can play a role.Why <strong>the</strong> normal plane algorithm occasionally fails to compute all minimais not fully understood at <strong>the</strong> moment <strong>and</strong> is an issue <strong>for</strong> fur<strong>the</strong>r studies. Theresults in Table 7 are quite bad compared to <strong>the</strong> o<strong>the</strong>r results. As n increases,<strong>and</strong> <strong>the</strong> maximal number of minima, more failures seem to occur.When computing <strong>the</strong> solutions G i to <strong>the</strong> Riccati equation, note that <strong>the</strong>matrices C <strong>and</strong> R involved contains powers A −1 <strong>and</strong> X. Hence even <strong>for</strong> notso big condition numbers of κ(A) <strong>and</strong> κ(X), <strong>the</strong> solutions G i can be somewhatperturbed due to finite floating point precision. What should be considered alarge or small perturbation here is uncertain. These computational errors thatarise do not seem to be <strong>the</strong> reason <strong>for</strong> <strong>the</strong> failures of <strong>the</strong> normal plane algorithm.More likely is perhaps that <strong>the</strong> normal plane intersections do not occur closeenough to <strong>the</strong> desired minimum <strong>for</strong> <strong>the</strong> particular problem.Since <strong>the</strong> number of normal plane intersections Q i grows as 2 n , it can be atime consuming task to use all intersections <strong>for</strong> larger problems. In <strong>the</strong>se cases,


On Global Minimization of <strong>Weighted</strong> <strong>Orthogonal</strong> <strong>Procrustes</strong> <strong>Problem</strong>s 123unless it is known that <strong>the</strong> problem does consist of several minima <strong>and</strong> notjust a few, a heuristic global optimization algorithm could be a better device.For example, if a WOPP with n = 8 only has, e.g., 10 minima, using all 256intersections (255 intersections, after one minimum is computed) is indeed anexaggeration. Never<strong>the</strong>less, <strong>for</strong> WOPPs with dimensions around n = 1, 2, . . ., 5,<strong>the</strong> usage of <strong>the</strong> normal plane algorithm has shown to be efficient in any case.Some final observations that were made during <strong>the</strong> computational experiments,are that <strong>the</strong> maximal amount of (unconnected) minima a WOPP canhave, seem to comply with <strong>the</strong> <strong>for</strong>mula 2 n . Though this is not studied in detailhere, more than 2 n minima is yet to be found. In Appendix E.3, when usingɛ = 1, <strong>the</strong> generated test problems seldom had 2 n minima. If B is small comparedto <strong>the</strong> surface of AQX, it seems as it is more likely that <strong>the</strong> WOPP willhave near maximal (or several) number of minima.AppendixA The canonical <strong>for</strong>m of a WOPPProposition A.1 The matrices A ∈ R mA×m <strong>and</strong> X ∈ R n×nX with Rank(A) =m <strong>and</strong> Rank(X) = n belonging to a WOPPmin 1 2 ||AQX − B||2 F , subject to QT Q = I n ,can always be considered as m by m <strong>and</strong> n by n diagonal matrices, respectively.be <strong>the</strong> singular value decom-Proof. Let A = U A Σ A VA T <strong>and</strong> X = U XΣ X VX Tposition of A <strong>and</strong> X. Then||U A Σ A V T A QU B Σ B V T B − B|| 2 F = ||U A Σ A ZΣ X V T X − B|| 2 F,where Z = V T A QU X ∈ R m×n has orthonormal columns. Since U T A U A = I mA<strong>and</strong> V T X V X = I nX it follows that||U A Σ A ZΣ X V T X − B||2 F = tr(U AΣ A ZΣ X V T X − B)T (U A Σ A ZΣ X V T X − B) == tr(V X Σ X Z T Σ 2 A ZΣ XV T X − 2V XΣ X Z T Σ 2 A UT A B + BT B) == tr(Σ X Z T Σ 2 A ZΣ X) − tr(2Σ B Z T Σ A U T A BV X) + tr(B T B) =tr(Σ A ZΣ X − U T ABV X ) T (Σ A ZΣ X − U T ABV X ) = ||Σ A ZΣ X − U T ABV X || F .Hence, without loss of generality we can assume that A = diag(α 1 , ..., α m ) <strong>and</strong>X = diag(χ 1 , ..., χ n ) with α i ≥ α i+1 ≥ 0 <strong>and</strong> χ i ≥ χ i+1 ≥ 0. ✷


124 Paper IVB Normal plane intersections <strong>for</strong> Q ∈ R 2×2The difference between balanced cases (m = n) <strong>and</strong> unbalanced cases (n < m) isthat when m = n <strong>the</strong> determinant sign of Q (ei<strong>the</strong>r det(Q) = 1 or det(Q) = −1)splits F in two disjoint parts, F + <strong>and</strong> F − . When Q ∈ R 2×2 we can use <strong>the</strong>representations[ ]cos(φ) − sin(φ)Q + (φ) =(13)sin(φ) cos(φ)<strong>for</strong> all Q with det(Q) = 1 <strong>and</strong>[ ]cos(φ) sin(φ)Q − (φ) =sin(φ) − cos(φ)<strong>for</strong> all Q with det(Q) = −1. Using (13) we can write⎡ ⎤ ⎡ ⎤10f(Q + ) = F( ⎢ 0⎥⎣ 0 ⎦ cos(φ) + ⎢ 1⎥⎣ −1 ⎦ sin(φ)).10(14)Note that F + is an ellipse in R 4 . Project b onto <strong>the</strong> plane determined by <strong>the</strong>ellipse. This results in <strong>the</strong> previous mentioned problem of finding a point onan ellipse in R 2 that lies closest to <strong>the</strong> projected component of b. Then <strong>for</strong> aminimizer Q 1,∗ , <strong>the</strong>re is a possibility of ano<strong>the</strong>r minimum in <strong>the</strong> vicinity of where<strong>the</strong> normal plane intersects <strong>the</strong> ellipse. The same reasoning can be applied toQ − .There are ei<strong>the</strong>r 2 n intersections or a continuum of intersections. For a given˜Q ∈ R 2×2 with, e.g., det( ˜Q) = +1, trivially <strong>the</strong>re are always two intersectionsof <strong>the</strong> normal plane at ˜Q <strong>and</strong> F + (just as <strong>for</strong> an ellipse in R 2 ). Hence <strong>the</strong> o<strong>the</strong>rintersections must occur with F − . If <strong>the</strong>re are a continuum of intersections,<strong>the</strong>n <strong>the</strong> normal plane coincide with <strong>the</strong> plane determined by <strong>the</strong> ellipse F − .This occurs, <strong>for</strong> instance, if A = X = I 2 , which corresponds to a simple <strong>for</strong>m ofan OPP.CNormal plane intersections <strong>for</strong> a simple OPPAssume ˜Q + is a solution tominQ ||Q − B||2 F , subject to Q ∈ V m,n ,with det( ˜Q + ) = 1. The solution is given by <strong>the</strong> SVD of B T = UΣV T as˜Q + = V U T [7]. According to [15] <strong>the</strong> minimum ˜Q − with opposite determinantsign det( ˜Q − ) = −1 is given by˜Q − = V ĨmU T ,


On Global Minimization of <strong>Weighted</strong> <strong>Orthogonal</strong> <strong>Procrustes</strong> <strong>Problem</strong>s 125where Ĩm = diag(1, . . .,1, −1) ∈ R m×m . If <strong>the</strong> intersection of <strong>the</strong> normal planeat ˜Q + with F occur at ˜Q − , <strong>the</strong>n <strong>the</strong>re must exist a symmetric G ∈ R m×m suchthat˜Q + + ˜Q + G = ˜Q − .We can write this asV U T + V U T G = V ĨmU T ⇒ G = U(Ĩm − I m )U T ,which yields that G is symmetric. Hence N p at ˜Q + intersects F at ˜Q − .D CARE : Multiple eigenvaluesThe continuous algebraic Riccati equation (CARE) isC T G i + G i C + G i RG i = 0, (15)where C = ˜Q T A −2 ˜QX 2 ∈ R n×n <strong>and</strong> R = ˜Q T A −4 ˜Q ∈ R n×n . When computingall 2 n solutions to (15) in [17], <strong>the</strong> matrix[ ]C RM =0 −C T ,is used. The CARE (15) only has a finite number of solutions, 2 n , if <strong>the</strong> eigenvaluesof M are distinct.Here we consider <strong>the</strong> special case when C (<strong>and</strong> <strong>the</strong>n M) has multiple eigenvalues,yielding a continuum of solutions. We only wish to compute 2 n solutions.The method described in Section 4.1 can fail <strong>for</strong> <strong>the</strong>se cases.The following <strong>the</strong>orem is a modified version of Theorem 7.5.4 in [9, page 174]to suit our problem with G + = 0. In this <strong>the</strong>orem, X + denotes <strong>the</strong> spectral subspaceof (C+DG + ) = C that is complementary to Null(G + −G − ) = Null(−G − ),where Null(·) denotes <strong>the</strong> null space.Theorem D.1 Suppose R is positive semi-definite, rank(R) = n <strong>and</strong> that (15)has at least one solution. Let H be a C-invariant 3 subspace with H ⊆ X + <strong>and</strong>defineY = ((−G − )H) ⊥ . 4Then Y is C-invariant <strong>and</strong> if P is <strong>the</strong> projection onto H along Y, <strong>the</strong>nis a solution of <strong>the</strong> CARE (15).G = G − (I − P) (16)3 In [9] a subspace S ⊆ R n is called invariant <strong>for</strong> <strong>the</strong> matrix C ∈ R n×n (or C-invariant) ifCx ∈ S <strong>for</strong> every x ∈ S.4 Here ⊥ denotes <strong>the</strong> orthogonal complement of <strong>the</strong> space.


126 Paper IVHaving computed G − <strong>and</strong> by choosing a subspace H i ⊆ X + that is C-invariant, a solution G i can be computed by solving (16). Corollary 7.6.5. in[9, page 182] states :Corollary D.1 Let G + <strong>and</strong> G − be <strong>the</strong> maximal <strong>and</strong> minimal solutions of <strong>the</strong>CARE (15), respectively. Then G + − G − is invertible if <strong>and</strong> only if M has nopure imaginary or zero eigenvalues.Since G + = 0 <strong>and</strong> <strong>the</strong> eigenvalues of M are real <strong>and</strong> nonzero, <strong>the</strong>n G − isinvertible. Hence Null(−G − ) = ∅ <strong>and</strong> X + = R n , so any C-invariant subspacecan be used. The subspaces H i are constructed by combining <strong>the</strong> eigenvectorsof C to <strong>for</strong>m 2 n different subspaces. The algorithm to compute <strong>the</strong> solutionsG i , i = 1, . . .,2 n , works as follows.Algorithm to compute <strong>the</strong> set Q if M has eigenvalues with multiplicity.1. Input : ˜Q, A <strong>and</strong> X.2. Q = ∅.3. Compute G − .4. [V, D] = eig(C).5. Let H i = V i , i = 1, . . .,n.6. LetH = {H 1 , . . .,H n , . . . , H 2 n}, (17)be <strong>the</strong> set of all combinations of eigenvectors to <strong>for</strong>m a subspace, i.e.,H j = [V k1 , . . . , V kp ] <strong>for</strong> n + 1 ≤ j ≤ 2 n where 1 ≤ k 1 < k 2 < . . . < k p ≤ n.7. <strong>for</strong> j = 1 to 2 n8. end <strong>for</strong>.7.1. O = (−G − H j ).7.2. P = H j (O T H j ) + O T .7.3. G j = G − (I − P).7.4. Q j = ˜Q + A −2 ˜QGj X −2 .7.5. Q := {Q, Q j }.This method has shown to work well <strong>for</strong> <strong>the</strong> special case when M has eigenvalueswith multiplicity. The set of eigenvectors of C computed by <strong>the</strong> algorithmis finite, hence <strong>the</strong> algorithm computes 2 n solutions. If, e.g., two eigenvalues ofC are equal λ i = λ i+1 , <strong>the</strong>n <strong>the</strong> corresponding two eigenvectors V i <strong>and</strong> V i+1 isnot uniquely defined. It is <strong>the</strong>n possible to use an infinite amount of different


On Global Minimization of <strong>Weighted</strong> <strong>Orthogonal</strong> <strong>Procrustes</strong> <strong>Problem</strong>s 127eigenvector combinations, resulting in that by oblique projections yields a solutionto (15). To a continuum of solutions it could be expected that an optimalsolution G o can be derived according toG o = arg{minG ||A ˜QX + A −2 ˜QGX −2 − B|| 2 F },where G belongs to a continuum of solutions.E Tables <strong>for</strong> <strong>the</strong> computational experimentsThe tables in this section display <strong>the</strong> following in<strong>for</strong>mation.No. ˆQi The number of minima found by <strong>the</strong> normal plane algorithm.c k , c e The number of r<strong>and</strong>om matrices Q 0 used by <strong>the</strong> heuristicmethod.Fails The number of additional minima found by <strong>the</strong> heuristicmethod.κ(A) Condition number of A.κ(X) Condition number of X.η To get a relative distance when presenting <strong>the</strong> residual norm,a relative residual norm is used asη = ||A ˆQX − B|| F||A ˆQX|| Fwhere ˆQ is <strong>the</strong> global minimum.ɛ Scale factor used when generating a test problem (12).E.1 <strong>Problem</strong>s of dimension n < 7Here each table is a sample of data picked from a set of 200 generated testproblems. All failures (if any) of <strong>the</strong> normal plane algorithm to compute allminima is presented. If a table shows no failures, <strong>the</strong>re was none.


128 Paper IVNo. ˆQi c k (c e ) Fails κ(A) κ(X) η ɛ2 3 ( 100 ) 0 109.53 21.08 0.06 0.572 4 ( 100 ) 0 60.82 21.3 0.32 0.42 3 ( 100 ) 0 14.7 66.47 0.24 0.472 2 ( 100 ) 0 17.28 100.08 0.55 0.132 2 ( 100 ) 0 40.83 63.55 0.09 0.52 4 ( 100 ) 0 73.73 30.51 0.14 0.452 6 ( 100 ) 0 25.02 27.9 0.15 0.282 2 ( 100 ) 0 76.31 57.2 0.19 0.13 7 ( 100 ) 0 41.89 15.7 0.05 0.643 5 ( 100 ) 0 50.5 49.83 0.14 0.593 8 ( 100 ) 0 101.79 42.38 0.14 0.423 10 ( 100 ) 0 22.44 13.37 0.13 0.634 6 ( 100 ) 0 33.97 26.65 0.52 0.34 14 ( 140 ) 0 85.26 25.43 0.14 0.124 10 ( 100 ) 0 87.84 39.89 0.13 0.384 5 ( 100 ) 0 70.05 13.31 0.11 0.594 5 ( 100 ) 0 46.88 17.85 0.14 0.354 5 ( 100 ) 0 98.17 37.14 0.13 0.184 7 ( 100 ) 0 62.83 48.44 0.11 0.464 8 ( 100 ) 0 109.21 34.53 0.15 0.394 9 ( 100 ) 0 108.5 9.29 0.18 0.584 7 ( 100 ) 0 11.08 24.24 0.7 0.244 4 ( 100 ) 0 62.37 23.68 0.64 0.064 9 ( 100 ) 0 72.7 14.61 0.09 0.474 7 ( 100 ) 0 23.99 16.48 0.26 0.29Table 1: Results <strong>for</strong> problems of dimension Q ∈ R 4×2 . Maximal amount ofminima here are 2 2 = 4.


On Global Minimization of <strong>Weighted</strong> <strong>Orthogonal</strong> <strong>Procrustes</strong> <strong>Problem</strong>s 129No. ˆQi c k (c e ) Fails κ(A) κ(X) η ɛ2 7 ( 100 ) 0 82.68 29.17 0.18 0.592 2 ( 100 ) 0 82.26 41.23 0.12 0.542 8 ( 100 ) 0 41.72 106.33 0.24 0.482 2 ( 100 ) 0 54.63 71.95 0.24 0.582 2 ( 100 ) 0 62.37 23.68 0.6 0.392 3 ( 100 ) 0 16.24 65.08 0.41 0.64 9 ( 100 ) 0 26.92 11.93 0.59 0.274 17 ( 170 ) 0 78.21 29.91 0.57 0.344 6 ( 100 ) 0 83.37 39.69 0.39 0.624 6 ( 100 ) 0 44.99 32.82 0.31 0.654 6 ( 100 ) 0 22.55 33.87 0.2 0.464 7 ( 100 ) 0 50.15 69.4 0.6 0.328 32 ( 320 ) 0 38.17 64.88 0.49 0.128 18 ( 180 ) 0 106.61 25.01 0.82 0.078 31 ( 310 ) 0 70.81 33.48 0.6 0.248 55 ( 550 ) 0 17.76 23.1 0.5 0.428 15 ( 150 ) 0 29.84 85.68 0.4 0.138 30 ( 300 ) 0 81.13 9.94 0.87 0.198 42 ( 420 ) 0 35.01 45.89 0.36 0.48 11 ( 110 ) 0 40.17 48.39 0.7 0.088 22 ( 220 ) 0 42.75 64.4 0.51 0.228 28 ( 280 ) 0 96.62 34.63 0.13 0.438 36 ( 360 ) 0 101.79 42.38 0.73 0.118 13 ( 130 ) 0 12.11 41.02 0.93 0.068 25 ( 250 ) 0 14.7 66.47 0.72 0.2Table 2: Results <strong>for</strong> problems of dimension Q ∈ R 4×3 . Maximal amount ofminima here are 2 3 = 8.


130 Paper IVNo. ˆQi c k (c e ) Fails κ(A) κ(X) η ɛ2 3 ( 100 ) 0 76.68 10.45 0.37 0.362 2 ( 100 ) 0 50.53 38.86 0.49 0.282 5 ( 100 ) 0 59.79 15.74 0.28 0.592 2 ( 100 ) 0 67.53 49.68 0.61 0.272 6 ( 100 ) 0 70.12 64.24 0.24 0.362 2 ( 100 ) 0 23.99 82.47 0.37 0.54 6 ( 100 ) 0 54.97 32.61 0.3 0.434 11 (110 ) 0 57.55 16.21 0.36 0.524 16 (160 ) 0 60.13 12.59 0.2 0.614 25 (250 ) 0 62.72 67.7 0.69 0.114 7 (100 ) 0 14.01 36.58 0.84 0.154 4 (100 ) 0 16.59 18.79 0.4 0.254 5 (100 ) 0 67.88 55.92 0.31 0.294 11 (110 ) 0 19.17 16.51 0.54 0.348 15 (150 ) 0 55.32 35.32 0.19 0.458 14 (140 ) 0 11.77 33.6 0.93 0.098 13 (130 ) 0 14.35 41.01 0.63 0.188 14 (140 ) 0 65.64 50.18 0.36 0.238 24 (240 ) 0 16.93 16.98 0.4 0.278 33 (330 ) 0 32.42 23.58 0.62 0.248 31 (310 ) 0 35.01 23.95 0.51 0.338 33 (330 ) 0 79.98 6.61 0.91 0.058 64 (640 ) 0 37.59 9.52 0.25 0.428 23 (230 ) 0 96.62 12 0.4 0.158 19 (190 ) 0 23.09 63.54 0.63 0.16Table 3: Results <strong>for</strong> problems of dimension Q ∈ R 5×3 . Maximal amount ofminima here are 2 3 = 8.


On Global Minimization of <strong>Weighted</strong> <strong>Orthogonal</strong> <strong>Procrustes</strong> <strong>Problem</strong>s 131No. ˆQi c k (c e ) Fails κ(A) κ(X) η ɛ8 48 ( 480 ) 0 99.88 25.77 0.58 0.338 18 ( 180 ) 0 60.82 39.81 0.22 0.378 86 ( 860 ) 0 19.86 42.41 0.25 0.5110 48 ( 480 ) 0 86.3 13.59 0.25 0.5812 38 ( 380 ) 0 11.94 32.31 0.71 0.1414 29 ( 290 ) 2 70.4 19.13 0.83 0.0914 33 ( 330 ) 0 14.35 9.62 0.62 0.2416 39 ( 390 ) 0 87.84 54.98 0.97 0.0616 49 ( 490 ) 0 39.14 27.38 0.9 0.0816 39 ( 390 ) 0 41.72 50.79 0.79 0.1216 42 ( 420 ) 0 93.01 29.37 0.26 0.1416 49 ( 490 ) 0 39.41 42.24 0.87 0.116 100 ( 1000 ) 0 73.92 39.23 0.7 0.1116 73 ( 730 ) 0 34.32 13.03 0.52 0.1916 95 ( 950 ) 0 42.93 35.06 0.42 0.1216 41 ( 410 ) 0 36.9 37.67 0.48 0.2316 32 ( 320 ) 0 65.3 17.45 0.93 0.0716 52 ( 520 ) 0 70.46 64.98 0.73 0.1516 50 ( 500 ) 0 21.75 43.53 0.66 0.1716 55 ( 550 ) 0 73.04 54.96 0.68 0.1916 37 ( 370 ) 0 84.47 55.31 0.71 0.1716 88 ( 880 ) 0 57.4 43.91 0.65 0.2616 34 ( 340 ) 0 65.64 51.95 0.42 0.2616 38 ( 380 ) 0 18.56 47.05 0.87 0.1116 60 ( 600 ) 0 45.33 12.85 0.92 0.12Table 4: Results <strong>for</strong> problems of dimension Q ∈ R 6×4 . Maximal amount ofminima here are 2 4 = 16.


132 Paper IVNo. ˆQi c k (c e ) Fails κ(A) κ(X) η ɛ16 125 ( 1250 ) 0 37.24 25.22 0.51 0.4216 163 ( 1630 ) 0 106.61 46.83 0.92 0.1216 83 ( 830 ) 0 11.77 33.6 0.85 0.216 125 ( 1250 ) 0 55.66 14.58 0.75 0.2816 64 ( 640 ) 0 60.82 69.54 0.49 0.3716 56 ( 560 ) 0 68.57 34.35 0.39 0.4920 330 ( 3300 ) 2 57.9 13.77 0.89 0.1424 116 ( 1160 ) 0 80.44 9.1 0.9 0.1324 115 ( 1150 ) 0 63.83 35.26 0.79 0.1924 139 ( 1390 ) 0 67.88 55.92 0.92 0.1124 230 ( 2300 ) 0 45.33 86.94 0.91 0.1224 126 ( 1260 ) 0 50.5 28.74 0.79 0.226 169 ( 1690 ) 0 11.08 58.97 0.44 0.4226 102 ( 1020 ) 2 14.35 41.01 0.73 0.2432 484 ( 4840 ) 0 39.14 26.76 0.91 0.0832 322 ( 3220 ) 0 90.43 40.61 0.9 0.132 169 ( 1690 ) 0 93.01 49.39 0.78 0.1432 191 ( 1910 ) 0 31.73 40.69 0.86 0.1532 269 ( 2690 ) 0 36.9 25.31 0.72 0.2332 221 ( 2210 ) 0 16.59 18.79 0.94 0.0932 175 ( 1750 ) 0 21.75 71.83 0.84 0.1732 177 ( 1770 ) 0 92.32 18.73 0.51 0.3332 241 ( 2410 ) 0 55.32 35.32 0.92 0.132 237 ( 2370 ) 0 65.64 50.18 0.62 0.2632 192 ( 1920 ) 0 42.75 61.23 0.92 0.08Table 5: Results <strong>for</strong> problems of dimension Q ∈ R 5×5 . Maximal amount ofminima here are 2 5 = 32.


On Global Minimization of <strong>Weighted</strong> <strong>Orthogonal</strong> <strong>Procrustes</strong> <strong>Problem</strong>s 133No. ˆQi c k (c e ) Fails κ(A) κ(X) η ɛ12 51 ( 510) 0 46.91 18.67 0.92 0.0912 36 ( 360) 0 63.06 28.2 0.46 0.2912 43 ( 430) 0 37.59 83.33 0.72 0.1514 40 ( 400) 0 47.79 63.99 0.48 0.2416 74 ( 740) 0 108.75 23.67 0.69 0.2616 69 ( 690) 0 98.86 21.3 0.21 0.5416 102 ( 1020) 0 101.44 35.15 0.71 0.1116 106 ( 1060) 0 51.88 13.38 0.39 0.2416 57 ( 570) 0 42.8 25.65 0.42 0.4416 41 ( 410) 0 63.41 22.8 0.85 0.0817 55 ( 550) 1 70.5 22.98 0.77 0.2324 68 ( 680) 0 103.34 41.14 0.83 0.124 216 ( 2160) 0 39.48 98 0.62 0.1424 222 ( 2220) 0 78.55 13.83 0.73 0.1426 259 ( 2590) 0 29.84 15.46 0.47 0.2228 132 ( 1320) 0 34.66 30.22 0.75 0.1831 98 ( 980) 1 85.26 39.33 0.85 0.0832 113 ( 1130) 0 62.37 14.26 0.82 0.1132 130 ( 1300) 0 64.95 17.1 0.4 0.2832 150 ( 1500) 0 29.07 37.4 0.96 0.0632 97 ( 970) 0 83.37 32.68 0.84 0.0932 91 ( 910) 0 86.3 43.08 0.96 0.0632 138 ( 1380) 0 45.33 17.77 0.96 0.0732 120 ( 1200) 0 96.62 18.63 0.67 0.1532 106 ( 1060) 0 22.44 99.24 0.71 0.09Table 6: Results <strong>for</strong> problems of dimension Q ∈ R 8×5 . Maximal amount ofminima here are 2 5 = 32.


134 Paper IVNo. ˆQi c k (c e ) Fails κ(A) κ(X) η ɛ32 135( 1350) 0 24.63 131.33 0.94 0.0732 296( 2960) 0 197.86 15.14 0.84 0.1332 185( 1850) 0 40.17 104.04 0.71 0.2432 128( 1280) 0 98.91 41.86 0.86 0.1632 94( 940) 0 127.51 14.02 0.32 0.3332 51( 510) 12 53.19 84.35 0.87 0.1134 116( 1160) 2 169.03 26.56 0.96 0.0738 890( 8900) 0 159.87 30.84 0.95 0.0840 201( 2010) 0 172.64 20.7 0.87 0.1140 252( 2520) 0 106.45 9.17 0.81 0.1340 242( 2420) 0 37.33 94.06 0.86 0.1241 209( 2090) 4 28.06 163.91 0.91 0.145 133( 1330) 3 99.66 205.94 0.77 0.145 177( 1770) 3 66.93 40.39 0.81 0.1946 345( 3450) 2 140.28 20.2 0.83 0.1447 286( 2860) 1 139.59 19.26 0.82 0.1448 343( 3430) 0 195.33 22.71 0.89 0.0848 375( 3750) 0 93.44 18.59 0.87 0.150 271( 2710) 1 53.65 59.95 0.88 0.1157 186( 1860) 6 182.28 50.69 0.91 0.160 218( 2180) 4 181.59 59.17 0.92 0.164 592( 5920) 0 159.25 24.29 0.93 0.0764 364( 3640) 0 145.44 51.74 0.8 0.164 181( 1810) 0 39.73 79.06 1 0.0164 224( 2240) 0 41.8 87.13 0.95 0.07Table 7: Results <strong>for</strong> problems of dimension Q ∈ R 8×6 . Maximal amount ofminima here are 2 6 = 64.


On Global Minimization of <strong>Weighted</strong> <strong>Orthogonal</strong> <strong>Procrustes</strong> <strong>Problem</strong>s 135E.2 Higher dimensional problemsSince <strong>the</strong> <strong>the</strong> maximal number of minima to a WOPP seems to grow as 2 n ,it is a more time consuming task to compute <strong>the</strong>m all <strong>for</strong> large values of n.Generating reasonable test problems, with a not so large residual, that havemaximal or near maximal number of minima is difficult. As seen in <strong>the</strong> tablesbelow, <strong>the</strong> relative residual norm η is around 1, <strong>and</strong> ɛ is very small. Here B isquite close to <strong>the</strong> origin, i.e., B ≈ 0. This results in that <strong>the</strong> minima will bealmost anti-diagonal on <strong>the</strong> <strong>for</strong>m stated in (7).No. ˆQi c k (c e ) Fails κ(A) κ(X) η ɛ92 675 ( 6750) 4 84.1 18.17 0.96 0.0796 745 ( 7450) 0 216.29 20.41 0.99 0.0396 761 ( 7610) 0 43.49 3.35 0.99 0.03120 1193 ( 11930) 0 52.43 40.81 0.93 0.05128 2102 ( 21020) 0 93.05 11.89 0.96 0.09128 1764 ( 17640) 0 98.22 41.05 0.99 0.01128 1223 ( 12230) 0 134.36 5.46 0.98 0.04128 1957 ( 19570) 0 211.81 7.69 0.99 0.01Table 8: Results <strong>for</strong> problems of dimension Q ∈ R 9×7 . Maximal amount ofminima here are 2 7 = 128.No. ˆQi c k (c e ) Fails κ(A) κ(X) η ɛ160 645 ( 6450 ) 0 73.78 27.42 0.96 0.05192 2919 ( 20000 ) 0 200.8 16.58 0.99 0.01256 1121 ( 11210 ) 0 53.12 28.97 0.99 0.01256 1432 ( 14320 ) 0 133.88 8.52 0.99 0.01256 2173 ( 20000 ) 0 58.29 22.53 0.99 0.02256 1036 ( 19210 ) 0 68.61 34.77 0.98 0.04Table 9: Results <strong>for</strong> problems of dimension Q ∈ R 10×8 . Maximal amount ofminima here are 2 8 = 256.E.3 Results when using ɛ = 1In <strong>the</strong> previous section, ɛ < 1 was used when generating B according to (12).This might seem ra<strong>the</strong>r unnatural. Here we present some results when usingɛ = 1, i.e., <strong>the</strong> perturbation generated is just B = ˆB + ˜B with ˆB <strong>and</strong> ˜B definedas be<strong>for</strong>e. Each table of data is a sample taken from a set of 500 tests. For aspecific dimension, <strong>the</strong> tests with largest amount of minima are presented.


136 Paper IVNo. ˆQi c k (c e ) Fails κ(A) κ(X) η ɛ..............4 9 ( 100 ) 0 36.43 107.42 0.13 14 4 ( 100 ) 0 51.92 16.53 0.1 15 13 ( 130 ) 0 35.38 46.45 0.14 15 30 ( 300 ) 0 112.84 28.83 0.13 15 26 ( 260 ) 0 100.8 25.12 0.13 15 9 ( 100 ) 0 46.76 105.81 0.14 16 8 ( 100 ) 0 64.3 71.19 0.14 16 776 ( 7760 ) 0 147.76 28.69 0.13 16 11 ( 110 ) 0 20.58 104.87 0.13 16 11 ( 110 ) 0 47.09 29.95 0.12 16 16 ( 160 ) 0 187.89 97.45 0.15 17 22 ( 220 ) 0 186.51 38.25 0.14 18 16 ( 160 ) 0 60.51 113.43 0.14 18 31 ( 310 ) 0 86.68 11.97 0.12 18 16 ( 160 ) 0 112.5 162.17 0.12 1Table 10: Results <strong>for</strong> problems of dimension Q ∈ R 4×3 <strong>and</strong> ɛ = 1.No. ˆQi c k (c e ) Fails κ(A) κ(X) η ɛ. . . . . . .4 7 ( 100 ) 0 87.37 10.89 0.15 14 9 ( 100 ) 0 97.7 154.45 0.15 14 17 ( 170 ) 0 82.43 48.18 0.14 14 15 ( 150 ) 0 154.5 41.57 0.13 14 7 ( 100 ) 0 67.41 85.9 0.06 14 8 ( 100 ) 0 69.83 173.1 0.15 14 43 ( 430 ) 0 190.65 52.55 0.15 14 9 ( 100 ) 0 37.12 32.7 0.15 14 5 ( 100 ) 0 42.28 9.76 0.11 15 32 ( 320 ) 0 152.77 10.49 0.13 15 10 ( 100 ) 0 61.89 45.07 0.14 15 64 ( 640 ) 0 16.8 30.72 0.13 16 82 ( 820 ) 0 55.35 87.56 0.11 16 34 ( 340 ) 0 86.68 19.77 0.13 16 141 ( 1410 ) 0 200.97 20.89 0.15 1Table 11: Results <strong>for</strong> problems of dimension Q ∈ R 5×3 <strong>and</strong> ɛ = 1.


On Global Minimization of <strong>Weighted</strong> <strong>Orthogonal</strong> <strong>Procrustes</strong> <strong>Problem</strong>s 137No. ˆQi c k (c e ) Fails κ(A) κ(X) η ɛ..............4 13 ( 130 ) 0 150.03 20.14 0.15 15 6 ( 100 ) 0 177.21 84.88 0.15 15 19 ( 190 ) 0 121.79 24.07 0.13 15 17 ( 170 ) 0 112.84 112.91 0.15 15 17 ( 170 ) 0 114.22 160.16 0.14 15 7 ( 100 ) 0 63.2 112.53 0.15 15 14 ( 140 ) 0 38.83 105.08 0.15 15 30 ( 300 ) 0 46.07 17.8 0.15 15 101 ( 1010 ) 0 169.3 82.06 0.15 15 40 ( 400 ) 0 82.21 117.09 0.14 15 136 ( 1360 ) 0 103.56 60.21 0.14 16 45 ( 450 ) 0 28.84 91.69 0.15 17 30 ( 300 ) 0 92.18 157.71 0.15 17 59 ( 590 ) 0 23.34 54.46 0.15 17 29 ( 290 ) 0 173.09 85.22 0.14 1Table 12: Results <strong>for</strong> problems of dimension Q ∈ R 6×4 <strong>and</strong> ɛ = 1.No. ˆQi c k (c e ) Fails κ(A) κ(X) η ɛ. . . . . . .10 41 ( 410 ) 0 184.44 23.8 0.15 110 45 ( 450 ) 0 57.42 43.57 0.15 110 32 ( 320 ) 0 109.74 168.85 0.14 110 12 ( 120 ) 0 193.05 93.12 0.15 110 148 ( 1480 ) 0 178.25 107.14 0.15 111 26 ( 260 ) 0 58.8 10.35 0.15 112 76 ( 760 ) 0 116.62 131.67 0.14 112 39 ( 390 ) 0 35.38 165.7 0.15 112 33 ( 330 ) 0 134.18 39.32 0.15 112 30 ( 300 ) 0 116.29 61.28 0.14 114 104 ( 1040 ) 0 179.28 54 0.15 114 57 ( 570 ) 0 32.98 107.63 0.12 115 144 ( 1440 ) 0 122.48 97.87 0.14 116 53 ( 530 ) 0 125.5 83.2 0.15 116 44 ( 440 ) 0 93.23 132.41 0.14 1Table 13: Results <strong>for</strong> problems of dimension Q ∈ R 5×5 <strong>and</strong> ɛ = 1.


138 Paper IVNo. ˆQi c k (c e ) Fails κ(A) κ(X) η ɛ..............5 18 ( 180 ) 0 76 23.67 0.15 15 31 ( 310 ) 0 13.02 82.07 0.12 15 10 ( 100 ) 0 71.19 43.39 0.14 15 18 ( 180 ) 0 206.14 39.92 0.15 15 8 ( 100 ) 0 129.37 112.56 0.14 16 20 ( 200 ) 0 187.53 75.05 0.15 16 16 ( 160 ) 0 137.55 66.03 0.14 16 24 ( 240 ) 0 109.05 23.43 0.15 16 16 ( 160 ) 0 178.25 48.02 0.15 16 28 ( 280 ) 0 77.74 62.28 0.15 17 11 ( 110 ) 0 48.47 50.13 0.14 18 13 ( 130 ) 0 208.19 9.98 0.14 18 86 ( 860 ) 0 192.24 92.38 0.13 18 20 ( 200 ) 0 111.81 57.43 0.15 18 16 ( 160 ) 0 50.54 126.63 0.14 1Table 14: Results <strong>for</strong> problems of dimension Q ∈ R 8×5 <strong>and</strong> ɛ = 1.No. ˆQi c k (c e ) Fails κ(A) κ(X) η ɛ. . . . . . .5 17 ( 170 ) 0 155.19 106.2 0.14 16 144 ( 1440 ) 0 39.86 222.47 0.14 16 29 ( 290 ) 0 143.13 18.87 0.11 16 27 ( 270 ) 0 118 60.25 0.15 16 18 ( 180 ) 0 124.55 141.27 0.14 16 20 ( 200 ) 0 176.18 34.95 0.15 16 19 ( 190 ) 0 104.58 13.23 0.14 16 66 ( 660 ) 0 125.92 86.99 0.13 16 12 ( 120 ) 0 109.08 15.17 0.14 16 144 ( 1440 ) 0 204.07 24.41 0.15 17 46 ( 460 ) 0 76 23.67 0.14 17 90 ( 900 ) 0 82.9 105.06 0.14 18 20 ( 200 ) 0 98.59 40.2 0.14 18 41 ( 410 ) 0 112.15 35.17 0.15 110 30 ( 300 ) 0 178.25 48.02 0.15 1Table 15: Results <strong>for</strong> problems of dimension Q ∈ R 8×6 <strong>and</strong> ɛ = 1.


On Global Minimization of <strong>Weighted</strong> <strong>Orthogonal</strong> <strong>Procrustes</strong> <strong>Problem</strong>s 139References[1] M. T. Chu <strong>and</strong> N. T. Trendafilov. On a Differential Equation Approach to<strong>the</strong> <strong>Weighted</strong> <strong>Orthogonal</strong> <strong>Procrustes</strong> <strong>Problem</strong>. Statistics <strong>and</strong> Computing,8(2):125–133, 1998.[2] M. T. Chu <strong>and</strong> N. T. Trendafilov. The <strong>Orthogonal</strong>ly Constrained RegressionRevisted. J. Comput. Graph. Stat., 10:746–771, 2001.[3] A. Edelman, T. A. Arias, <strong>and</strong> S. T. Smith. The Geometry of <strong>Algorithms</strong>with <strong>Orthogonal</strong>ity Constraints. SIAM Journal on Matrix Analysis <strong>and</strong>Applications, 20(2):303–353, 1998.[4] L. Eldén. Solving Quadratically Constrained Least Squares <strong>Problem</strong>s Usinga Differential-Geometric Approach. BIT Numerical Ma<strong>the</strong>matics, 42(2),2002.[5] L. Eldén <strong>and</strong> H. Park. A <strong>Procrustes</strong> problem on <strong>the</strong> Stiefel manifold.Numer. Math., 82(4):599–619, 1999.[6] W. G<strong>and</strong>er. Least Squares with a Quadratic Constraint. Numer. Math.,36:291–307, 1981.[7] G. H. Golub <strong>and</strong> C. F. Van Loan. Matrix Computations. The Johns HopkinsUniversity Press, 1989.[8] M. A. Koschat <strong>and</strong> D. F. Swayne. A Weig<strong>the</strong>d <strong>Procrustes</strong> Criterion. Psychometrika,56(2):229–239, 1991.[9] P. Lancaster <strong>and</strong> L. Rodman. The Algebraic Riccati Equation. Ox<strong>for</strong>dUniversity Press, 1995.[10] A. Mooijaart <strong>and</strong> J. J. F. Comm<strong>and</strong>eur. A General Solution of <strong>the</strong> Weig<strong>the</strong>dOrthonormal <strong>Procrustes</strong> <strong>Problem</strong>. Psychometrika, 55(4):657–663, 1990.[11] P. H. Petkov, M. M. Konstantinov, D. W. Gu, <strong>and</strong> V. Mehrmann. NumericalSolution of Matrix Riccati Equations: A Comparison of Six Solvers.Niconet Report, 1999-10, 1999.[12] J. E. Potter. Matrix Quadratic Solutions. J. SIAM Appl. Math., 14(3):496–501, 1966.[13] T. Rapcsak. On Minimization on Stiefel manifolds. European J. Oper.Res., 143(2):365–376, 2002.[14] I. Söderkvist. Some Numerical Methods <strong>for</strong> Kinematical Analysis. ISSN-0348-0542, UMINF-186.90, Department of Computing Science, Umeå University,1990.[15] I. Söderkvist <strong>and</strong> Per-Åke Wedin. On Condition Numbers <strong>and</strong> <strong>Algorithms</strong><strong>for</strong> Determining a Rigid Body Movement. BIT, 34:424–436, 1994.


140 Paper IV[16] E. Stiefel. Richtungsfelder und Fernparallelismus in n-dimensionalen Mannigfaltigkeiten.Commentarii Math. Helvetici, 8:305–353, 1935-1936.[17] T. Vikl<strong>and</strong>s. On Global Minimization of <strong>Weighted</strong> <strong>Orthogonal</strong> <strong>Procrustes</strong><strong>Problem</strong>s. Technical Report UMINF-06.09, Department of Computing Science,Umeå University, Umeå, Sweden, 2006.[18] T. Vikl<strong>and</strong>s. On <strong>the</strong> Number of Minima to <strong>Weighted</strong> <strong>Orthogonal</strong> <strong>Procrustes</strong><strong>Problem</strong>s. Technical Report UMINF-06.08, Department of Computing Science,Umeå University, Umeå, Sweden, 2006.[19] T. Vikl<strong>and</strong>s <strong>and</strong> P. Å . Wedin. <strong>Algorithms</strong> <strong>for</strong> Linear Least Squares <strong>Problem</strong>son <strong>the</strong> Stiefel manifold. Technical Report UMINF-06.07, Departmentof Computing Science, Umeå University, Umeå, Sweden, 2006.[20] P. Å . Wedin <strong>and</strong> T. Vikl<strong>and</strong>s. <strong>Algorithms</strong> <strong>for</strong> 3-dimensional <strong>Weighted</strong><strong>Orthogonal</strong> <strong>Procrustes</strong> <strong>Problem</strong>s. Technical Report UMINF-06.06, Departmentof Computing Science, Umeå University, Umeå, Sweden, 2006.


Paper VA Cubic Convergent Iteration MethodThomas Vikl<strong>and</strong>s ∗Department of Computing Science, Umeå UniversitySE-901 87 Umeå, Sweden.vikl<strong>and</strong>s@cs.umu.seAbstractAn iteration function x k+1 = φ(x k ) to solve f(x) = 0 where f(x) ∈ R m<strong>and</strong> x ∈ R m is presented. The iteration method yields cubic convergence<strong>and</strong> is based on a secant update by using second order derivatives off(x). Normally computing <strong>the</strong> second order derivatives of f(x) is a computationalheavy task. The special case when f(x) is a set of quadraticfunctions is studied. In <strong>the</strong>se instances <strong>the</strong> second order derivatives areconstant, <strong>the</strong>n a method using second order derivatives could be to prefer.Keywords : Cubic, higher order, Halley, Chebyshev, quadratic equations,Hessian.∗ Financial support has partly been provided by <strong>the</strong> Swedish Foundation <strong>for</strong> Strategic Researchunder <strong>the</strong> frame program grant A3 02:128.143


144 Paper VContents1 Introduction 1452 Secant update of J(x) 1462.1 Convergence analysis . . . . . . . . . . . . . . . . . . . . . . . . . 1473 Computational experiments 1493.1 Tests on some st<strong>and</strong>ard test functions . . . . . . . . . . . . . . . 1493.1.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . 1523.2 Tests on quadratic functions. . . . . . . . . . . . . . . . . . . . . 1523.2.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . 155References 155


A Cubic Convergent Iteration Method 1451 IntroductionLet f(x) ∈ R m be a vector valued function of x ∈ R m . This paper proposes aniteration function Φ(x) that uses <strong>the</strong> second order derivatives of f(x) to computea solution f(x) = 0. It is shown that <strong>the</strong> iteration x k+1 = Φ(x k ) exhibits cubicconvergence <strong>for</strong> m = 1. A proof <strong>for</strong> a general dimension m ≥ 1 can be foundin [7]. Here it is left out due to <strong>the</strong> ra<strong>the</strong>r complicated representations of <strong>the</strong>higher order derivatives of Φ(x) <strong>for</strong> m > 1.Methods that uses higher order derivatives are seldom used, due to that<strong>the</strong>y tend to be computational heavy. However, if <strong>the</strong> second order derivativesof f(x) are constant, <strong>the</strong>n <strong>the</strong> computational cost in each iterate decreases (<strong>for</strong>a method that uses second order in<strong>for</strong>mation). Hence we do some special studyon <strong>the</strong> case when f(x) is a set of quadratic equations.Consider <strong>the</strong> Taylor expansion of f(x) = [f 1 (x), . . . , f m (x)] T around x ∈ R masf(x + p) = f(x) + J(x)p + [p T ⊙ H(x)]p + O||p|| 3where p ∈ R m is a step length, J(x) ∈ R m×m is <strong>the</strong> Jacobian of f(x) <strong>and</strong>H(x) ∈ R m×m×m is a third order tensor containing <strong>the</strong> second order derivativesof f(x). Here H(x) is represented such that H i ∈ R m×m is <strong>the</strong> Hessian of f i (x)<strong>and</strong> <strong>the</strong> product ⊙ involved is defined as⎡[p T ⊙ H(x)] = ⎢⎣p T ⎤H 1p T H 2..⎥⎦ ∈ Rm×m .p T H mThe search direction <strong>for</strong> Newton’s method [2] to solve f(x) = 0 is based on <strong>the</strong>linear approximationf(x k ) + J(x k )p(x k ) = 0 ⇒p k = −J −1k f k,yielding <strong>the</strong> iteration <strong>for</strong>mulax k+1 = x k + p k . (1)Newton’s method is known to converge quadratically to a solution ˆx of f(x) = 0.Some classical methods that uses second order in<strong>for</strong>mation is Halley’s methodalso known as <strong>the</strong> method of tangent hyperbolas [6] <strong>and</strong> <strong>the</strong> Chebyshev method[4]. For m = 1 Halley’s method isx k+1 = x k −<strong>and</strong> Chebyshev methodf(x k )f ′ (x k )(1 − 1 2 f ′ (x k ) −2 f ′′ (x k )f(x k )) .x k+1 = x k − (1 + 1 2 f ′ (x k ) −1 f ′′ (x k )f ′ (x k ) −1 f(x k ))f ′ (x k ) −1 f(x k ).


146 Paper VThese methods converge cubic to a solution ˆx of f(x) = 0.In [1] a multivariate Halley method <strong>for</strong> m ≥ 1 is presented according tox k+1 = x k + (p k) 2p k + 1 2 b kwherep k = −J −1kf k , b k = J −1k [pT k ⊙ H(x k)]p k .Since p k <strong>and</strong> b k are vectors in R m , <strong>the</strong> operation (p k ) 2 <strong>and</strong> <strong>the</strong> division in (2)are cared out element wise.The writers own interpretation on how to implement Halley’s method <strong>for</strong><strong>the</strong> multivariate case isx k+1 = x k + (I + 1 2 J −1k [pT k ⊙ H(x k)]) −1 p k . (3)There are methods that exhibit cubic convergence but without using secondorder in<strong>for</strong>mation. Typically <strong>the</strong>se methods uses some iterations ”inside” <strong>the</strong>actual iteration as, e.g.,x k+1 = x k − f ′ (x k ) −1 (f(x k ) + f(x k + p k )),from [6, page 315]. This iteration method does not solely rely on local in<strong>for</strong>mation,since <strong>the</strong> function value f(x k + p k ) is used. In [3], a method thatuses f ′ (x + 0.5p k ) in each step is presented. However, in this paper we leave<strong>the</strong>se multi-step type methods aside <strong>and</strong> focus on methods that only uses localin<strong>for</strong>mation at a point x k .2 Secant update of J(x)To derive an iteration method that uses second order in<strong>for</strong>mation, one can useH to make a secant update of J according to <strong>the</strong> following. Given a Newtondirection p k , to make refinement in each iteration take<strong>and</strong>Now solve<strong>and</strong> update˜f k = f k + J k p k + 1 2 [pT k ⊙ H(x k )]p k = 1 2 [pT k ⊙ H(x k )]p k (4)The iteration function Φ(x) can be written as(2)˜J k = J k + [p T k ⊙ H(x k)]. (5)−1˜p k = − ˜Jk˜f k (6)x k+1 = x k + p k + ˜p k = Φ(x k ). (7)Φ(x) = x + h(x), (8)h(x) = −(I − (J(x) + [p(x) T ⊙ H(x)]) −1 1 2 [p(x)T ⊙ H(x)])J(x) −1 f(x), (9)but <strong>for</strong> computational purposes (7) is to prefer due to a lesser computationalcost.


A Cubic Convergent Iteration Method 1472.1 Convergence analysisIn this section we prove that (8) exhibit cubic convergence <strong>for</strong> when m = 1. Aproof <strong>for</strong> a general m ≥ 1 has been done in [7], but it is much more tedious <strong>and</strong><strong>the</strong>re<strong>for</strong> left out. It involves higher order differentiations of Φ(x) ∈ R m , whichresults in that different tensor products (<strong>and</strong> tensors) are needed to be defined.With m = 1 write (8) asΦ(x) = x − (1 − (f ′ (x) + p(x)f ′′ (x)) −1 1 2 p(x)f ′′ (x))f ′ (x) −1 f(x), (10)where p(x) = −f ′ (x) −1 f(x).Assumption 2.1 Let N be a neighborhood of a solution ˆx, f(ˆx) = 0. Assumethat<strong>for</strong> all x ∈ N, <strong>and</strong> any derivative|f ′ (x) + p(x)f ′′ (x))| > 0, (11)|f ′ (x)| > 0, (12)is bounded <strong>for</strong> all x ∈ N.d (j) f(x)dx (j) , j = 0, 1, 2, ... (13)Theorem 2.1 With <strong>the</strong> conditions stated in Assumption 2.1, (10) convergecubic to a solution ˆx of f(x) = 0 aswhere γ ≥ 0.||x k+1 − ˆx|| ≤ γ||x k − ˆx|| 3 ,Proof. To show cubic convergence we make use of a Taylor expansion ofΦ(x) around x = ˆx, i.e.,Φ(x k ) = Φ(ˆx) + Φ ′ (ˆx)s + 1 2! Φ′′ (ˆx)s 2 + R,where s = x k − ˆx <strong>and</strong> <strong>the</strong> remainder term R isR = 1 3!∫ 10(1 − t) 2 Φ ′′′ (ˆx + ts)s 3 dt.The error at each iteration step ||x k+1 − ˆx|| can <strong>the</strong>n be written as||x k+1 − ˆx|| = ||Φ(x k )−Φ(ˆx)|| = ||Φ(ˆx)+Φ ′ (ˆx)s+ 1 2! Φ′′ (ˆx)s 2 +R−Φ(ˆx)||. (14)If now Φ ′ (ˆx) = 0 <strong>and</strong> Φ ′′ (ˆx) = 0, <strong>the</strong>n Φ(x) exhibits cubic convergence since(14) fulfills||x k+1 − ˆx|| = ||R|| ≤ 1 ∫ 13! ||s||3 ||(1 − t) 2 || · ||Φ ′′′ (ˆx + ts)||dt ⇒0


148 Paper V||x k+1 − ˆx|| ≤ 1 ∫ 13! ||x k − ˆx|| 3 ||Φ ′′′ (ˆx + ts)||dt ≤ γ||x k − ˆx|| 3 , (15)0whereγ = 1 3! maxx||Φ′′′ (x)|| , x ∈ [ˆx, ˆx + s],<strong>and</strong> Φ ′′′ (x) is bounded due to <strong>the</strong> assumptions (11), (12) <strong>and</strong> (13). Proving thatΦ ′ (ˆx) = 0 <strong>and</strong> Φ ′′ (ˆx) = 0 can be done by straight on differentiation. To simplifythis ra<strong>the</strong>r tedious task <strong>the</strong> following <strong>for</strong>m is usedwhereΦ(x) = x + h(x) = x + A(x)p(x)A(x) = 1 − D −1 (x)W(x) , D(x) = f ′ (x) + p(x)f ′′ (x) ,W(x) = p(x)f ′′ (x), p(x) = −(f ′ (x)) −1 f(x).2To make <strong>the</strong> expressions more viewable, we use, e.g., f := f(x), p := p(x) <strong>and</strong>so on.Φ ′ (x) = 1 + (A ′ p + Ap ′ )wherep ′ = −(f ′ ) −1 (f ′′ p + f ′ ),A ′ = −D −1 W ′ + D −2 WD ′ ,D ′ = f ′′ + pf ′′′ + p ′ f ′′ ,W ′ = pf ′′′ + p ′ f ′′.2At x = ˆx we <strong>the</strong>n have thatp(ˆx) = 0,soAdditionally at x = ˆx,p ′ (ˆx) = −(f ′ ) −1 f ′ = −1,W(ˆx) = 0 ⇒ A = 1,Φ ′ (ˆx) = 1 + (A ′ p + Ap ′ ) = 1 + (0 − 1) = 0.D = f ′ , W ′ = − f ′′The second order derivative of Φ(x) is2⇒ A ′ (ˆx) = −D −1 W ′ = (f ′ ′′−1 f)2 .Φ ′′ (x) = A ′′ p + A ′ p ′ + Ap ′′ + A ′ p ′ ,where p ′′ (x) = −(f ′ ) −1 (f ′′′ p + 2f ′′ p ′ + f ′′ ). At x = ˆx <strong>the</strong>nsinceΦ ′′ (ˆx) = p ′′ − 2A ′ = 0,p ′′ (ˆx) = −(f ′ ) −1 (0 + −2f ′′ + f ′′ ) = (f ′ ) −1 f ′′ = 2A ′ (ˆx).We have showed that Φ ′ (ˆx) = 0 <strong>and</strong> Φ ′′ (ˆx) = 0, hence (15) holds, which is <strong>the</strong>condition <strong>for</strong> cubic convergence. ✷


A Cubic Convergent Iteration Method 1493 Computational experimentsIn this section <strong>the</strong> methods mentioned are tested on different problems. Firstsome st<strong>and</strong>ard test functions are used, most coming from [5]. In <strong>the</strong> second partquadratic functions are considered.3.1 Tests on some st<strong>and</strong>ard test functionsName <strong>the</strong> four different methods mentioned as M1-M4 whereM1. Newton’s method (1)M2. Halley’s method according to (2).M3. Halley’s method according to (3).M4. The method (7).The following 6 test functions are chosen. ˆx is <strong>the</strong> solution f(ˆx) = 0 <strong>and</strong> ˜xis a point used to generate initial values x 0 <strong>for</strong> <strong>the</strong> methods, described below.F1. Rosenbrock function[ 10(x2 − xf(x) =2 1 ) ] [ 1, ˆx =1 − x 1 1], ˜x =[ −1.21].F2. Freudstein <strong>and</strong> Roth function[ ] [−13 + x1 + ((5 − xf(x) =2 )x 2 − 2)x 2 5, ˆx =−29 + x 1 + ((x 2 + 1)x 2 − 14)x 2 4], ˜x =[0.5−2].F3. Powell singular function⎡⎤ ⎡x 1 + 10x 2f(x) = ⎢ 5 1/2 (x 3 − x 4 )⎥⎣ (x 2 − 2x 3 ) 2 ⎦ , ˆx = ⎢⎣10 1/2 (x 1 − x 4 ) 20000⎤⎡⎥⎦ , ˜x = ⎢⎣3−101⎤⎥⎦ .F4. Powell badly scaled function[]10f(x) =4 x 1 x 2 − 1exp(−x 1 ) + exp(−x 2 ) − 1.0001[ 0˜x =1].[1.098 . . . · 10−5, ˆx =9.106 . . .],F5. Broyden tridiagonal function in R 4 ,f i (x) = (3 − 2x i )x i − x i−1 − 2x i+1 + 1,


150 Paper Vwhere i = 1, . . .,4 <strong>and</strong> x 0 = x 5 = 0.⎡ ⎤ ⎡−1˜x = ⎢ −1⎥⎣ −1 ⎦ , ¯x = ⎢⎣−1−0.772 . . .−0.837 . . .−0.714 . . .−0.441 . . .⎤⎥⎦F6. A version of <strong>the</strong> Rosenbrock banana function[ ] [ −400xy + 400xf(x) =3 − 2 + 2x 1200y − 200x 2 , ˆx =1], ˜x =[ −1.21].It is desired to have more data than just one test per function <strong>for</strong> eachmethod. Hence <strong>the</strong> initial values x 0 are chosen in two different ways.X1. Let x 0 = ˆx + δx be a starting point in <strong>the</strong> vicinity of <strong>the</strong> solution.X2. Let x 0 = ˜x + δx be a starting point in <strong>the</strong> vicinity of ˜x.Here δx ∈ R m is a r<strong>and</strong>omly generated perturbation. The magnitude ||δx|| ofδx is <strong>for</strong> X1 chosen to be bigger than in X2. This is because ˜x is already aperturbation of ˆx.For both cases X1 <strong>and</strong> X2, different perturbations δx are generated yieldingseveral initial values x 0 . Now when comparing <strong>the</strong> number of steps <strong>for</strong> eachmethod needed to converge to ˆx, it could be fair to only consider <strong>the</strong> caseswhen <strong>the</strong> initial values x 0 yields convergence <strong>for</strong> all. Hence if all methodsconverge <strong>for</strong> <strong>the</strong> same initial value x 0 , we save <strong>the</strong> number of iterations <strong>for</strong>each method (unless a method per<strong>for</strong>ms very badly, <strong>the</strong>n it is discarded). Thenthis is repeated until 100 different initial values have given convergence <strong>for</strong> allmethods. The iteration <strong>for</strong> each method is terminated when ||f(x)|| < 10 −14<strong>and</strong> <strong>the</strong>n it is checked if x = ˆx.The following tables show results <strong>for</strong> each method M1-M4 used on each testfunction F1-F5. The columns X1 <strong>and</strong> X2 shows <strong>the</strong> average number of iterationsneeded <strong>for</strong> convergence to ˆx when choosing x 0 from X1 or X2 respectively.The columns marked Fails are <strong>the</strong> number of times that <strong>the</strong> method did notconverge to ˆx <strong>for</strong> initial values chosen according to X1 or X2.F1 X1, ||δx|| ≤ 1 Fails X1 X2, ||δx|| ≤ 0.01 Fails X2M1 2 0 2.02 0M2 - - - -M3 1 0 1.18 0M4 1 0 1.06 0M2 was discarded from <strong>the</strong> test since it resulted in division by zero at <strong>the</strong>solution yielding a NaN 1 solution.1 The IEEE arithmetic representation <strong>for</strong> Not-a-Number.


A Cubic Convergent Iteration Method 151F2 X1, ||δx|| ≤ 1 Fails X1 X2, ||δx|| ≤ 0.01 Fails X2M1 4.45 0 40.35 0M2 2.91 226 25.7 132M3 2.91 0 25.56 0M4 2.89 0 5.72 0M4 per<strong>for</strong>med extremely well on <strong>the</strong> set X2. Again M2 resulted in divisionby zero at <strong>the</strong> solution, hence <strong>the</strong> quite large number of fails.F3 X1, ||δx|| ≤ 1 Fails X1 X2, ||δx|| ≤ 0.01 Fails X2M1 22.7 0 26 0M2 14.99 0 16 0M3 14.51 0 16 0M4 11.53 0 13 0The test function F3 is a quadratic function, hence <strong>the</strong> tensor H(x) is constant<strong>for</strong> all x, <strong>and</strong> additionally H(x) is also sparse. M4 needs about half <strong>the</strong>amount of iterations as M1. As we shall see in next subsection, using higherorder in<strong>for</strong>mation could <strong>the</strong>n be to prefer.F4 X1, ||δx|| ≤ 1 Fails X1 X2, ||δx|| ≤ 0.01 Fails X2M1 7.55 0 13 0M2 5.65 71 7 85M3 5.77 2 7 0M4 4.19 4 - -M4 was discarded from <strong>the</strong> set X2 due to that it resulted in convergencetowards ano<strong>the</strong>r solution. For this problem <strong>the</strong> Jacobian of f(x) became rankdeficient on several occasions when using M4. M2 suffered again some fromNaN solutions.F5 X1, ||δx|| ≤ 1 Fails X1 X2, ||δx|| ≤ 0.01 Fails X2M1 4.77 0 5 0M2 3.35 0 3 0M3 3.14 0 3 0M4 2.65 0 3 0No bigger difference in <strong>the</strong> methods here. f(x) is again a quadratic function.Each Hessian H i (x) ∈ R 4×4 only has one non-zero element. It could be expectedis that if H(x) is very sparse as in this case, <strong>the</strong>n <strong>the</strong>re is not much to benefitfrom using second order in<strong>for</strong>mation.F6 X1, ||δx|| ≤ 1 Fails X1 X2, ||δx|| ≤ 0.01 Fails X2M1 5.87 0 7.13 0M2 - - - -M3 17.07 27 38.19 130M4 4.19 0 4.48 0M2 was again discarded due to NaN solutions. M3 per<strong>for</strong>med extremely bad on


152 Paper Vboth X1 <strong>and</strong> X2 <strong>for</strong> this test function.3.1.1 ConclusionsThe element wise division used by M2 could give a division by zero close to <strong>the</strong>solution when <strong>the</strong> Newton search direction p k becomes very small. This could beprevented if <strong>the</strong> iteration process is terminated <strong>for</strong> small p k . However, this mightresult in a lesser accurate computation of <strong>the</strong> solution. On all test problems M4used fewest number of iterations. Though it showed to not be so stable ontest F4. The very good per<strong>for</strong>mance of M2 in test F2 when using <strong>the</strong> set X2is suspected to arise due to ”luck”. Using a more perturbed initial solutionsresults in smaller differences between <strong>the</strong> methods. Why M4 per<strong>for</strong>med badlyon F6 is not clear, using any o<strong>the</strong>r initial value (not as in X1 or X2) seemed toresult in a very large number of iterations compared to <strong>the</strong> o<strong>the</strong>r methods.3.2 Tests on quadratic functions.Here we consider <strong>the</strong> case when f(x) ∈ R m is a set of quadratic equations on<strong>the</strong> <strong>for</strong>m f i (x) = a T i x + xT G i x + c i , i = 1, . . . , m. We express f(x) asf(x) = c + Ax + [x T ⊙ G]x = 0, (16)where c ∈ R m , A ∈ R m×m <strong>and</strong> G i ∈ R m×m , i = 1, . . .,m ⇒ G ∈ R m×m×m .The Jacobian of f(x) isJ(x) = A + [x T ⊙ H],where each Hessian matrix H i ∈ R m×m is H i = G i + G T i , i.e., not dependingon x. This results in that <strong>the</strong> computational cost <strong>for</strong> M2-M4 decreases in eachiterate. To see if <strong>the</strong>se methods gives an increase in efficiency we need to study<strong>the</strong> number of FLOPS used.Let C(O) denotes <strong>the</strong> computational cost of an operation O. All methodsneeds to compute [x T ⊙ G] <strong>and</strong> [x T ⊙ H] in each iterate. The costs areC([x T ⊙ G]) = C([x T ⊙ H]) = m 2 (m − 1).Having computed <strong>the</strong>se, <strong>the</strong> cost <strong>for</strong> evaluating f(x) <strong>and</strong> J(x) areC(f) = C(Ax) + m(2m − 1) + 2m = m(2m − 1) + m(2m − 1) + 2m,C(J) = m 2 .The cost <strong>for</strong> solving <strong>the</strong> equation system J k p k = f k is4m 3 − 3m 2 − m6so <strong>for</strong> Newton’s method M1, <strong>the</strong> cost of each iterate isC(M1) = 2m 2 (m − 1) + C(f) + C(J) + ( 4m3 − 3m 2 − m6+ 2m 2 − m, (17)+ 2m 2 − m) + m =


A Cubic Convergent Iteration Method 153= 143 m3 + 9 2 m2 − 1 6 m.After inspection of M3 it is seen that if m > 1 <strong>the</strong>n M4 is using less FLOPSin each iterate, hence we only consider M4. The additional computations <strong>for</strong>M4 are ˜f, ˜J <strong>and</strong> <strong>the</strong>n solving ˜J ˜p = − ˜f. The costs <strong>for</strong> <strong>the</strong>se computations areC([p T ⊙ H]) = m 2 (2m − 1) , C( ˜f) = m(2m − 1) + m , C( ˜J) = m 2<strong>and</strong> <strong>the</strong> cost of solving ˜J ˜p = − ˜f is as in (17). The number of FLOPS in eachiterate <strong>for</strong> M4 is <strong>the</strong>n derived asC(M4) = C(M1) + m 2 (2m − 1) + C( ˜f)++C( ˜J) + ( 4m3 − 3m 2 − m6+ 2m 2 − m) + 2m == 223 m3 + 8m 2 + 2 3 m.Assume now that M1 <strong>and</strong> M4 converge in k 1 <strong>and</strong> k 4 steps respectively. If weshould benefit from using M4 <strong>the</strong>n it is desired that k 4 C(M4) < k 1 C(M1) orequivalentlyk 4< C(M1) = κ(m), (18)k 1 C(M4)whereκ(m) = 28m − 14(11m + 1) .A plot of κ(m) is shown in Figure 1. The minimum value of κ(m) is given atm = 1 <strong>and</strong> as m increases so do κ(m),lim κ(m) = 7m→∞ 11 .When testing <strong>the</strong> condition (18) <strong>for</strong> different dimensions m <strong>the</strong> followingsetup is used. With MATLAB notations :1. Generate a solution ˆx = r<strong>and</strong>n(m, 1) <strong>and</strong> r<strong>and</strong>om matrices A = r<strong>and</strong>n(m, m),G i = r<strong>and</strong>n(m, m), i = 1, . . . , m.2. c = −(Aˆx + [ˆx T ⊙ G]ˆx).The initial values x 0 used <strong>for</strong> M1 <strong>and</strong> M4 are generated as x 0 = ˆx + δx,where ||δx|| ≤ 1/m is a r<strong>and</strong>om generated perturbation. As m increases sodoes <strong>the</strong> number of solutions to (16). If <strong>the</strong>n ”large” perturbations δx are used,convergence towards o<strong>the</strong>r solutions occur. Hence ||δx|| ≤ 1/m is used sinceit is local convergence we wish to examine. Now <strong>for</strong> each m = 1, 2, . . .,20ten different problems are generated according to 1-2 above. For each of <strong>the</strong>seproblems initial values x 0 are generated until both methods have converged 100times (when using <strong>the</strong> same x 0 ). This result in 1000 test totally <strong>for</strong> a given m.Figure 2 shows <strong>the</strong> average steps k 1 <strong>and</strong> k 4 used by each method <strong>for</strong> a given m.In Figure 1 <strong>the</strong> condition (18) is shown. This condition is not always fulfilled,Figure 3 shows <strong>the</strong> average percentage of failures <strong>for</strong> (18).


154 Paper V1.110.9κ(m) <strong>and</strong> k 4/k 10.80.70.60.50 5 10 15 20 25 30mFigure 1: The function κ(m) where κ(1) = 9/16 = 0.5625, <strong>and</strong> <strong>the</strong> quotientk 4 /k 1 <strong>for</strong> different tests of dimension m marked o.141210k 1<strong>and</strong> k 4864200 5 10 15 20mFigure 2: The average number of iterations k 1 marked o, <strong>and</strong> k 4 marked xneeded by M1 <strong>and</strong> M4 respectively to solve a problem of dimension m.


A Cubic Convergent Iteration Method 155100908070% Failures60504030201000 5 10 15 20mFigure 3: This figure shows <strong>the</strong> percentage of failures of (18) <strong>for</strong> a given m.3.2.1 ConclusionsThe tests indicate that a cubic convergent method can result in an increasedper<strong>for</strong>mance when solving quadratic functions, here <strong>for</strong> smaller problems withm < 15. The condition (18) seldom failed to hold <strong>for</strong> 1 < m < 14. If <strong>the</strong>second order tensor H is also sparse <strong>the</strong>n each iteration step <strong>for</strong> M4 is evenmore cheaper to compute, <strong>the</strong>n an additional gain could arise. But it should beclear that <strong>the</strong> tests here are purely r<strong>and</strong>om generated <strong>and</strong> it is expected thatresults could differ a lot depending on <strong>the</strong> type of quadratic problem considered.References[1] A. A. M. Cuyt <strong>and</strong> L. B. Rall. Computational Implementation of <strong>the</strong> MultivariateHalley Method <strong>for</strong> Solving Nonlinear Systems of Equations. ACMTransactions on Ma<strong>the</strong>matical Software, 11(1):20–36, 1985.[2] J. E. Dennis <strong>and</strong> R. B. Schnabel. Numerical Methods <strong>for</strong> UnconstrainedOptimization <strong>and</strong> Nonlinear Equations. Prentice-Hall, 1983.[3] H. H. H. Homeier. A Modified Newton method with Cubic Convergence:The Multivariate case. J. Comput. Appl. Math., 169(1):161–169, 2004.[4] M. A. Hernández J. M. Gutiérrez. New Recurrence Relations <strong>for</strong> Chebyshevmethod. Appl. Math. Lett., 10(2):63–65, 1997.


156 Paper V[5] J. J. Moré, B. S. Garbow, <strong>and</strong> K. E. Hillstrom. Testing Unconstrained OptimizationSoftware. ACM Transactions on Ma<strong>the</strong>matical Software, 7(1):17–41, 1981.[6] Ortega <strong>and</strong> Rheinholdt. Iterative Solution of Nonlinear Equations in SeveralVariables. Academic Press, 1970.[7] T. Vikl<strong>and</strong>s. A Note on Representations of Derivative Tensors of Vector ValuedFunctions. Technical Report UMINF-05.14, Department of ComputingScience, Umeå University, Umeå, Sweden, 2005.


Paper VIOptimization tools <strong>for</strong> solving nonlinearill-posed problems ∗Thomas Vikl<strong>and</strong>s <strong>and</strong> Marten GullikssonDepartment of Computing Science, Umeå UniversitySE-901 87 Umeå, Sweden.vikl<strong>and</strong>s@cs.umu.se, marten@cs.umu.seAbstractUsing <strong>the</strong> L- <strong>and</strong> a-curve, we consider how a nonlinear ill-posed Tikhonovregularized problem can be solved with a Gauss-Newton method. The solutionto <strong>the</strong> problem is chosen from <strong>the</strong> point on <strong>the</strong> logarithmic L-curvethat has maximum curvature, i.e <strong>the</strong> corner. The L-curve is used <strong>for</strong> analyzinghow <strong>the</strong> tradeoff between minimizing <strong>the</strong> solution norm versus <strong>the</strong>residual norm changes with <strong>the</strong> choice of regularization parameter. Thea-curve is used to analyze how <strong>the</strong> object function changes <strong>for</strong> differentchoices of <strong>the</strong> regularization parameter. In <strong>the</strong> numerical tests we solve aninverse problem in heat transfer where <strong>the</strong> L-curve solution is compared to<strong>the</strong> optimal solution <strong>and</strong> <strong>the</strong> solution given by <strong>the</strong> discrepancy principle.Keywords : Tikhonov regularization, L-curve, ill-posed, heat equation, cornersolution.∗ Fast solution of discretized optimization problems (Berlin, 2000), 255–264, Internat. Ser.Numer. Math., 138, Birkhäuser, Basel, 2001. Published here with permission from BirkhäuserVerlag, Basel, Switzerl<strong>and</strong>.159


160 Paper VIContents1 Introduction 1612 The Nonlinear L-curve 1612.1 The Shadow L-curve . . . . . . . . . . . . . . . . . . . . . . . . . 1622.2 The a-curve . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1633 Definitions of <strong>the</strong> corner 1643.1 Formulation of Reginska . . . . . . . . . . . . . . . . . . . . . . 1643.2 Definition by curvature . . . . . . . . . . . . . . . . . . . . . . . 1644 Estimating <strong>the</strong> corner using Reginska <strong>for</strong>mulation 1655 Approaching <strong>the</strong> corner 1655.1 A simple way of choosing λ . . . . . . . . . . . . . . . . . . . . . 1665.2 Updating λ using <strong>the</strong> linear L-curve . . . . . . . . . . . . . . . . 1675.3 Using <strong>the</strong> a-curve . . . . . . . . . . . . . . . . . . . . . . . . . . . 1676 Local convergence towards <strong>the</strong> corner 1687 The Algorithm in total 1698 Numerical Simulations 1708.1 Example 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1718.2 Example 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1718.3 Example 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1728.4 Example 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1729 Conclusions 172References 173


Optimization tools <strong>for</strong> solving nonlinear ill-posed problems 1611 IntroductionConsider a problem of <strong>the</strong> <strong>for</strong>m1minx 2 ‖f(x)‖2 W + 1 2 λ‖L(x − x c)‖ 2 V (1)where W <strong>and</strong> V are any spaces such that f : V → W , x c <strong>the</strong> center ofregularization, λ as <strong>the</strong> regularization parameter <strong>and</strong> L a matrix describingtrans<strong>for</strong>mation of x.Suppose that no in<strong>for</strong>mation about <strong>the</strong> noise in f is known, such as <strong>the</strong>noise level or <strong>the</strong> st<strong>and</strong>ard deviation. Thus, <strong>the</strong> solution to (1) can not becalculated in <strong>the</strong> sense of <strong>the</strong> well known discrepancy principle, or by any o<strong>the</strong>rmethod, that result in a convergent method [12][2]. That is when <strong>the</strong> noiseδ → 0 <strong>the</strong> regularization parameter λ given by <strong>the</strong> method should also tendto zero, resulting in <strong>the</strong> non-regularized solution. A tool that can be used <strong>for</strong>analyzing ill-posed problems is <strong>the</strong> controversial L-curve, that has shown to besuccessful in some application areas dealing with linear problems [1]. However,it has been shown that <strong>the</strong> usage of <strong>the</strong> L-curve result in an inaccurate methodwhen choosing regularization parameter [3][8][9]. Our intention is to investigate<strong>the</strong> use of <strong>the</strong> L-curve from an engineers point of view.2 The Nonlinear L-curveThe L-curve <strong>for</strong> nonlinear problems is defined as <strong>the</strong> curve(t(x), y(x))wheret(x) = 1 2 ||f(x)||2 , y(x) = 1 2 ||L(x − x c)|| 2<strong>and</strong>x = x(λ)is <strong>the</strong> solution to (1). It can be shown that <strong>the</strong> L-curve y(t), y : R → R have<strong>the</strong> basic local propertiesdydt = −1 λ < 0 , d 2 ydt 2 > 0 (2)which defines y as a monotonically decreasing, strictly convex function of t [4].The L-curve in a logarithmic scale, (log(t), log(y)), is expected to have <strong>the</strong>shape of <strong>the</strong> letter ”L” if <strong>the</strong> problem is ill-posed <strong>and</strong> contains noise. Theshape is a result of a radical growth of <strong>the</strong> solution norm y as <strong>the</strong> regularizationparameter λ gets small, which is common <strong>for</strong> ill-posed problems. A reasonablesolution should lie in <strong>the</strong> vicinity of <strong>the</strong> ”corner”, where y is about to startgrowing <strong>and</strong> t almost remains fix.


162 Paper VIlog yDecreasing λWith noise in f(x)Without noise in f(x)Increasing λlog tFigure 1: Two L-curves in a logarithmic scale, one without noise <strong>and</strong> <strong>the</strong> o<strong>the</strong>rfrom a problem where f containing noise.For linear problems <strong>the</strong> construction of <strong>the</strong> L-curve is pretty straight<strong>for</strong>ward[1]. We just solve <strong>the</strong> <strong>the</strong> linear problem <strong>for</strong> a wide range of regularization parametersλ, <strong>and</strong> locate <strong>the</strong> corner solution. Doing this <strong>for</strong> nonlinear problemscan be very time consuming, since attaining every point (t, y) requires <strong>the</strong> solutionof a nonlinear problem. Instead of computing <strong>the</strong> ”exact” L-curve, wecompute an approximation to it.2.1 The Shadow L-curveTo attain an approximation to <strong>the</strong> L-curve <strong>for</strong> (1) we can use an algorithm similarto <strong>the</strong> following:1. While no solution found1.1. Choose regularization parameter λ i1.2. Compute a search direction p(λ i )1.3. Check <strong>for</strong> descent f(x i + p) < f(x i )1.4. Take x i+1 = x i + p1.5. t i = 1 2 ||f(x i+1)|| 2 , y i = 1 2 ||L(x i+1 − x c )|| 21.6. i = i + 1By ga<strong>the</strong>ring points {t i = t(x(λ i )), y i = y(x(λ i ))} given during iteration wemay pick out a subset M from <strong>the</strong> set {t i , y i } defining a monotonically decreasingconvex function. These points will always lie on or above <strong>the</strong> exact L-curve,


Optimization tools <strong>for</strong> solving nonlinear ill-posed problems 163y(x)True Nonlinear L-curvePolygon Shadow L-curve (Convex set)Points (t i , y i ) from x iy 1y 2y 3y 4t 1 t 2 t 3 t 4t(x)Figure 2: The shadow L-curve approximating <strong>the</strong> true L-curve.hence approximating it. The set M defines ano<strong>the</strong>r set of linear functions frompoint (t i , y i ) to (t i+1 , y i+1 ) which is called <strong>the</strong> polygon shadow L-curve, y sp .Since it lacks of smoothness let y sm (t) be a convex monotonically decreasingspline function interpolating <strong>the</strong> set M. y sm (t) is <strong>the</strong> smooth shadow L-curvethat contains higher order of in<strong>for</strong>mation, such as first <strong>and</strong> second order derivatives.For fur<strong>the</strong>r details see[10].Iterating with a fix λ <strong>and</strong> getting closer to <strong>the</strong> exact L-curve is not meaningfulunless it is assumed that keeping it fixed would give <strong>the</strong> corner solution or abetter approximation of <strong>the</strong> L-curve is needed. Our method is to choose <strong>the</strong>regularization parameters so that <strong>the</strong> algorithm converges towards <strong>the</strong> corner.Close to <strong>the</strong> corner we compute a more precise approximation of <strong>the</strong> L-curve.2.2 The a-curveThe a-curve is defined as <strong>the</strong> curve (λ, a(λ)) wherea(λ) = min t(x) + λy(x)xwhich is just <strong>the</strong> object function itself. Clearly it describes how <strong>the</strong> optimizationproblem (1) behaves when changing <strong>the</strong> regularization parameter. Havinga good approximation of <strong>the</strong> a-curve gives in<strong>for</strong>mation on how <strong>the</strong> optimizationproblem depends of <strong>the</strong> regularization parameter. However, <strong>the</strong> a-curve contains<strong>the</strong> same in<strong>for</strong>mation as <strong>the</strong> L-curve but it is just represented differently.


164 Paper VIThe a-curve has <strong>the</strong> local propertiesdadλ = y > 0 ,d 2 adλ 2 < 0defining it as a strictly concave, strictly increasing function of λ[4].As in <strong>the</strong> case with <strong>the</strong> L-curve, <strong>the</strong>re exist piecewise linear functions a sp (λ) =t i + λy i that is <strong>the</strong> polygon shadow a-curve [5]. A smooth shadow a-curve caneasily be constructed when using <strong>the</strong> smooth shadow L-curve.3 Definitions of <strong>the</strong> cornerAs stated be<strong>for</strong>e, <strong>the</strong> L-curve plotted in a logarithmic scale is expected to havea corner if <strong>the</strong> ill-posed problem contains noise. The corner may be defined as<strong>the</strong> point where <strong>the</strong> logarithmic L-curve has its maximum curvature[1][7]. Thecorner describes <strong>the</strong> point where <strong>the</strong> tradeoff between minimizing <strong>the</strong> solutionnorm <strong>and</strong> residual norm is somehow balanced. This does not mean that <strong>the</strong>solution x(λ) given at <strong>the</strong> corner is optimal but in any cases a resonable solution(<strong>the</strong> engineering point of view).3.1 Formulation of ReginskaThe corner with maximum curvature can be found by solvingminλt(x(λ))y(x(λ)) α , α > 0 (3)<strong>for</strong> linear functions f(x) = Ax −b [7], <strong>and</strong> it is generalized to <strong>the</strong> nonlinear casein [4]. This function can be quite nonlinear <strong>and</strong> experience heavy oscillations.We consider instead <strong>the</strong> minimization problemminλ{log t + log y} (4)which has <strong>the</strong> same minimum as (3) under <strong>the</strong> assumption α = 1. (4) correspondsto <strong>the</strong> minimum of <strong>the</strong> logarithmic L-curve rotated π/4 radians [4].3.2 Definition by curvatureAno<strong>the</strong>r way of computing <strong>the</strong> corner is using <strong>the</strong> <strong>for</strong>mula <strong>for</strong> <strong>the</strong> curvature of<strong>the</strong> logarithmic L-curve [1]. Withτ = log t , η = log y<strong>the</strong> curvature of y(t) in a logarithmic scale becomesκ =d 2 ηdτ 2(1 + ( dηdτ )2 ) 3/2


Optimization tools <strong>for</strong> solving nonlinear ill-posed problems 165Thus finding <strong>the</strong> corner corresponds to solving a maximization problemmax κλwhich has to be solved by using <strong>the</strong> smooth approximation of <strong>the</strong> L-curve y sm (t).4 Estimating <strong>the</strong> corner using Reginska <strong>for</strong>mulationConsider <strong>the</strong> set of points τ i = log t i , η i = log y i given during <strong>the</strong> iteration whensolving (1).log yRotationlog tFigure 3: Rotation of three points in <strong>the</strong> vicinity of <strong>the</strong> corner. The approximatedminimum is marked ×.When locating <strong>the</strong> corner every set of three points{(τ i−1 , η i−1 ), (τ i , η i ), (τ i+1 , η i+1 )}is rotated π/4 radians. Then, if <strong>the</strong> midpoint (τ i , η i ) is a minimum, <strong>the</strong> corneris approximated by <strong>the</strong> minimum of a quadratic spline interpolating <strong>the</strong>se threepoints.5 Approaching <strong>the</strong> cornerThe aim is to locate an interval of <strong>the</strong> L-curve in which a corner defined accordingto <strong>the</strong> Reginska <strong>for</strong>mulation exists. Later on we shall show that whenthis interval is found <strong>and</strong> that convergence towards <strong>the</strong> corner is not problematic.Using <strong>the</strong> Gauss-Newton method <strong>the</strong> linearized problem,minp12 ‖f(x i) + J(x i )p‖ 2 + 1 2 λ i‖x i + p − x c ‖ 2 (5)


166 Paper VIwhereJ(x i ) = ∂f∂x (x i),is solved giving <strong>the</strong> search direction p = p(λ). The regularization parametermust be chosen properly so that a ’safe’ step is taken. We regard <strong>the</strong> searchdirection to be ’safe’ if x k <strong>and</strong> x k + p k is close to <strong>the</strong> same nonlinear L-curve.Choosing a small λ-value will give a very large step length, that may lead toa different trajectory. Consequently, it is recommended to begin with a ra<strong>the</strong>rbig λ <strong>and</strong> gradually decrease λ during iteration if <strong>the</strong> center of regularization isconsidered well chosen. Since a problem may exist of more than one trajectory,resulting in many different L-curves, we define <strong>the</strong> global L-curve as <strong>the</strong> convexset of all <strong>the</strong>se different L-curves.A general iteration may look as follows.1. Calculate λ i2. Compute direction p(λ i )3. x i+1 = x i + αp i is <strong>the</strong> new approximation to <strong>the</strong> solution of (1).4. Determine if x i+1 will belong to <strong>the</strong> convex hull of <strong>the</strong> polygon shadowL-curve.5. If <strong>the</strong> convex set approximates a corner, steer <strong>the</strong> solutions towards it.5.1 A simple way of choosing λIf using a Gauss Newton method, <strong>the</strong> linear problem(J T J + λL T L)p = J T fis to be solved in each iteration. Hence it is reasonable to choose λ 0 initially in<strong>the</strong> order of <strong>the</strong> largest singular values of J T J <strong>and</strong> <strong>the</strong>n gradually decrease λ.The nonlinearity <strong>and</strong> ill-posedness combined with a problem consisting of manylocal minima makes <strong>the</strong> choice of λ crucial in order to get convergence towards<strong>the</strong> corner solution.An easy way to update λ is to divide it by a constant greater than one ineach iteration.λ i+1 = λ ik , k > 1.Though a large value of <strong>the</strong> constant k might lead to loosing convergence, thismethod can be effective if <strong>the</strong> problem is large <strong>and</strong> not too sensitive to <strong>the</strong>choice of <strong>the</strong> regularization parameter.


Optimization tools <strong>for</strong> solving nonlinear ill-posed problems 1675.2 Updating λ using <strong>the</strong> linear L-curveSt<strong>and</strong>ing at x i with Jacobian J i , compute <strong>the</strong> step p(λ i ) <strong>for</strong> a large set ofregularization parameters from (5). Then use <strong>the</strong> linear L-curvey lin = 1 2 ||x i + p(λ i ) − x c || 2 , t lin = 1 2 ||f(x i) + J i p i || 2defined by λ i <strong>and</strong> compare with <strong>the</strong> actual residual reductiony = 1 2 ||x i + p(λ) − x c || 2 , t = 1 2 ||f(x i + p i )|| 2 .log yTrue Nonlinear L-curveLinear L-curve ||f(x i) + J ip i(λ)|| 2Actual residual ||f(x i + p i(λ))|| 2log yx i+1 = x i + p(λ i)x i+2 = x i+1 + p(λ i+1)x i+1x ix ilog tlog tFigure 4: The step should be taken into <strong>the</strong> area where <strong>the</strong> linear L-curve <strong>and</strong><strong>the</strong> actual reduction starts to differ.Depending on <strong>the</strong> nonlinearity of <strong>the</strong> problem, <strong>the</strong> differences between <strong>the</strong>L-curve <strong>for</strong> <strong>the</strong> linearized <strong>and</strong> nonlinear problem might vary. The safest wayto choose <strong>the</strong> regularization parameter is to pick a λ where <strong>the</strong>se two L-curvesstarts to differ. Clearly, if <strong>the</strong> L-curves do not differ greatly, a λ correspondingto <strong>the</strong> corner of <strong>the</strong> linear L-curve should be chosen. The cost of calculating <strong>the</strong>linear L-curve will decide if this method is preferred.5.3 Using <strong>the</strong> a-curveSince <strong>the</strong> a-curve is <strong>the</strong> object function as a function of λ, analyzing it could yielda more robust <strong>and</strong> ma<strong>the</strong>matically correct method of updating <strong>the</strong> regularizationparameter. Our idea is to use a shooting algorithm, but <strong>the</strong>se methods are oftenvery inaccurate. In this case <strong>the</strong>re is no in<strong>for</strong>mation about <strong>the</strong> function a(λ),except that it is a decreasing concave function. Ano<strong>the</strong>r difficulty is that <strong>the</strong>points approximating <strong>the</strong> a-curve are distributed ra<strong>the</strong>r ”logarithmical”. The


168 Paper VIa-curve can have a very drastic change <strong>for</strong> small λ values so it seems a goodidea to work in a logarithmic scale <strong>and</strong> additionally eliminate <strong>the</strong> possibility toget negative values of <strong>the</strong> regularization parameter.Investigation of <strong>the</strong> curvature <strong>for</strong> a(λ)κ = d2 a/(1 + y) (6)dλcould maybe yield some in<strong>for</strong>mation making it possible to per<strong>for</strong>m a more accurateshooting algorithm.log a(λ)Approximated a-curveδaTrue a-curvelog λ i+1 log λ ilog λFigure 5: The a-curve in a log-scale. The dotted curve shows how it is possibleto get into ano<strong>the</strong>r trajectory when choosing small regularization parameters.However, it all boils down to that, at (λ i , a i ), <strong>the</strong> regularization parameterλ i+1 must be estimated on <strong>the</strong> basis of minimizing a(λ) some amount δa. Aneffective <strong>and</strong> safe way of doing this is yet to be discovered.6 Local convergence towards <strong>the</strong> cornerAssume that we have attained three points {(t i+1 , y i+1 ), (t i , y i ), (t i−1 , y i−1 )}.All of <strong>the</strong>m are assumed to be close to <strong>the</strong> exact L-curve <strong>and</strong> approximating acorner defined in section 4.1. Obviously if <strong>the</strong>se points are lying on <strong>the</strong> exactL-curve <strong>the</strong> inequalityλ i+1 ≤ λ corner ≤ λ i−1 (7)is satisfied <strong>and</strong> it is reasonable to assume that (7) holds also <strong>for</strong> points in <strong>the</strong>vicinity to <strong>the</strong> L-curve. The general idea of <strong>the</strong> algorithm is to shrink <strong>the</strong>


Optimization tools <strong>for</strong> solving nonlinear ill-posed problems 169distances between <strong>the</strong> three points <strong>and</strong> steer <strong>the</strong>m closer to corner <strong>the</strong>, which<strong>the</strong>y approximate.x i−1x i−1,ix i−1,i+1x Lcx i+1x i,i+1x imin 1 2 ||f(x)||2Level curves t(x) = 1 2 ||f(x)||2Figure 6: The level curves around <strong>the</strong> corner solution x LcConsider now <strong>the</strong> solutions x(λ) at each point <strong>and</strong> define <strong>the</strong> new solutionsx i+1,i = x i+1 + x i2<strong>and</strong> regularization parameters, x i,i−1 = x i + x i−12, x i+1,i−1 = x i+1 + x i−12λ i+1,i = 10 (1 2 (log(λi+1)+log(λi)) , λ i,i−1 = 10 (1 2 (log(λi)+log(λi−1)) ,From each pointλ i+1,i−1 = 10 (1 2 (log(λi+1)+log(λi−1)){x i+1,i , x i,i−1 , x i+1,i−1 }calculate a new solution with belonging regularization parameter{λ i+1,i , λ i,i−1 , λ i+1,i−1 }.Since <strong>the</strong> L-curve is locally convex, <strong>the</strong>se new solutions give arise to three newpoints on <strong>the</strong> L-curve with a better approximation of <strong>the</strong> corner.7 The Algorithm in totalThe algorithm constructed is not to be regarded as a black box, since modificationsmight be needed depending on <strong>the</strong> optimization problem. The choice of<strong>the</strong> regularization parameter in each iteration may be very tedious <strong>and</strong> result ina trial <strong>and</strong> error technique. Fur<strong>the</strong>r, <strong>the</strong> algorithm can h<strong>and</strong>le situations where<strong>the</strong> approximated corner is not well defined. This might happen if <strong>the</strong> shadowL-curve is a bad approximation of <strong>the</strong> exact L-curve.


170 Paper VI1. While no corner solution is found2. Compute <strong>the</strong> Jacobian J i3. Choose a regularization parameter λ i4. While (t k , y k ) not belongs to <strong>the</strong> convex set M(a) Compute Jacobian J k(b) Iterate with x k+1 = x k + αp(λ i ) where λ i is fixed(c) t k = ||x k+1 − x c || 2 , y k = ||f(x k+1 )|| 25. If <strong>the</strong>re exist three points in M approximating a corner(a) While approximating corner existi. Calculate x i+1,i , x i,i−1 , x i+1,i−1 <strong>and</strong> <strong>the</strong> new regularization parameters.ii. While <strong>the</strong> three points not belong to <strong>the</strong> convex set MA. Compute each Jacobian J i+1,i , J i,i−1 , J i+1,i−1B. Iterate <strong>for</strong> all three points using fixed regularization parameters.iii. If x i+1,i ≈ x i,i−1 ≈ x i+1,i−1 return x i,i−1 as corner solutioniv. else update M with <strong>the</strong> new points(b) If <strong>the</strong> corner is lost, continue from 1 with corner searching.8 Numerical SimulationsThe inverse problem is to determine <strong>the</strong> conductivity σ(x) from <strong>the</strong> heat transferequationdud(σ(x)dx− ) = f(x), 0 < x < 1, , (8)dxu(0) = u 0 , u(1) = u 1 .where f ∈ L 2 <strong>and</strong> |u x | > 0. The measured quantity is denoted ũ(x) <strong>and</strong><strong>the</strong>n <strong>the</strong> problem can be <strong>for</strong>mulated asF(σ) ≈ ũ,where <strong>the</strong> nonlinear operator F : H 1 → L 2 is Fréchet-diffeerentiable with aLipschitz-continuous derivative.The minimization problem is stated asminσ12 ||F(σ) − ũ||2 L 2+ 1 2 λ||σ − σ c|| 2 H 1. (9)


Optimization tools <strong>for</strong> solving nonlinear ill-posed problems 171In order to calculate F(σ) equation (8) is solved using a finite element representationwith linear spline approximations according tou fe (x) =n∑β j ϕ j (x) , σ(x) ≈j=1m∑θ s ϕ s (x)s=1After discretization, <strong>the</strong> search direction p(λ) is found by solving <strong>the</strong> discreteversion of (9)(J T W n J + λW pm )p = bwhere <strong>the</strong> matrix J is <strong>the</strong> Jacobian, b is a vector <strong>and</strong> W n <strong>and</strong> W pm are innerproduct matrices.In <strong>the</strong> examples δ is <strong>the</strong> noise level, λ ∗ is <strong>the</strong> optimal regularization parameterthat minimizes ||σ(λ)−σ∗||||σ(λ)||where σ ∗ is <strong>the</strong> exact solution. λ c <strong>and</strong> λ d are<strong>the</strong> regularization parameters given by <strong>the</strong> L-curve <strong>and</strong> discrepancy principlemethod respectively.8.1 Example 1For this example that is <strong>the</strong> most ill posed, <strong>the</strong> initial valueσ 0 (x) = 2 + 1.428x 5 − 4.382x 4 + 1.04x 3 + 3.63x 2was used. The exact solution is assumed to be σ ∗ (x) = 1 <strong>and</strong> <strong>the</strong> exact datau(x) = e x .The regularization parameter corresponding to <strong>the</strong> corner of <strong>the</strong> L-curve is<strong>for</strong> this example too small compared to <strong>the</strong> optimal λ ∗ . And when <strong>the</strong> noiseincrease, <strong>the</strong> estimated regularization parameter λ c differ even more from <strong>the</strong>optimal λ ∗ . The regularization parameter given by <strong>the</strong> discrepancy principle isin this case a very good approximation of <strong>the</strong> optimal.8.2 Example 2Here <strong>the</strong> initial value used wasσ 0 (x) = 1 +δ(%) λ ∗ λ c λ d0.6 2.0 · 10 −3 6.0 · 10 −4 2.0 · 10 −30.05 8.2 · 10 −4 1.7 · 10 −4 1.1 · 10 −30.003 2.5 · 10 −4 8.1 · 10 −6 3.3 · 10 −4110 sinh(1) (9 − 4x + 4x2 − 4(cosh(x) − cosh(x − 1)))<strong>and</strong> with assumptions that σ ∗ (x) = 1 <strong>and</strong> u(s) = s(1 − s). For this example<strong>the</strong> corner solution lies very close to <strong>the</strong> optimal. The solution given by <strong>the</strong>discrepancy principle is too big <strong>and</strong> not as good as <strong>the</strong> corner solution.


172 Paper VIδ(%) λ ∗ λ c λ d4 1.1 · 10 −3 2.1 · 10 −3 2.7 · 10 −30.5 7.4 · 10 −5 2.2 · 10 −4 1.1 · 10 −30.05 1.7 · 10 −5 1.6 · 10 −5 1.4 · 10 −40.005 6.7 · 10 −6 1.6 · 10 −6 4.1 · 10 −58.3 Example 3In this example <strong>the</strong> initial guess was σ 0 = 2, u(s) = s(1 − s) <strong>and</strong> <strong>the</strong> exactsolution was assumed to be σ ∗ (s) = 1 + 0.1sin(2πs). The result from thisexample is about <strong>the</strong> same as <strong>for</strong> <strong>the</strong> previous. The corner solution is againquite close to <strong>the</strong> optimal solution, while <strong>the</strong> discrepancy principle results in abit too large regularization parameter.δ(%) λ ∗ λ c λ d5 6.1 · 10 −4 2.1 · 10 −3 1.5 · 10 −30.5 1.0 · 10 −4 1.6 · 10 −4 6.1 · 10 −40.05 6.7 · 10 −6 1.3 · 10 −5 2.5 · 10 −40.005 1.4 · 10 −7 1.4 · 10 −6 3.3 · 10 −48.4 Example 4Here <strong>the</strong> exact solution has a discontinuity <strong>and</strong> is assumed to be 1 on <strong>the</strong> interval0 ≤ x ≤ 0.5 <strong>and</strong> 2 on 0.5 < x ≤ 1. Fur<strong>the</strong>r <strong>the</strong> data was u(s) = s(1 − s) <strong>and</strong><strong>the</strong> initial guess used was σ 0 = 2.This problem is not that ill-posed, regularization was only needed at <strong>the</strong>beginning of <strong>the</strong> iteration if <strong>the</strong> noise level was small. However, in this case λ c<strong>and</strong> λ d differs quite much from <strong>the</strong> optimal λ ∗ .9 Conclusionsδ(%) λ ∗ λ c λ d0.5 6.7 · 10 −6 1.3 · 10 −3 2.5 · 10 −40.05 5.0 · 10 −9 1.4 · 10 −5 4.1 · 10 −50.005 1.0 · 10 −10 1.0 · 10 −6 1.6 · 10 −5We have come to <strong>the</strong> conclusion that <strong>the</strong>re indeed exist cases where <strong>the</strong> ”heuristic”L-curve method is capable of approximate a reasonable solution to a nonlinearill-posed problem. Theoretically <strong>the</strong> solution given by <strong>the</strong> corner of <strong>the</strong>L-curve x Lc , might even be a worse approximation of <strong>the</strong> exact x ∗ than <strong>the</strong>center of regularization x c , due to <strong>the</strong> fact that <strong>the</strong> method is nonconvergent.However, when dealing with non-artificial problems, <strong>the</strong> solution x(λ c ) is to beconsidered as <strong>the</strong> solution in <strong>the</strong> vicinity of x c that results in a small residual,not as an optimal or best possible solution.


Optimization tools <strong>for</strong> solving nonlinear ill-posed problems 173References[1] P.C Hansen, Rank-Deficient <strong>and</strong> Discrete Ill-posed <strong>Problem</strong>s. Numericalaspects on linear inversion., SIAM, Philadelphia, (1997).[2] A.B Bakushinskii <strong>and</strong> A.V Goncharskii, Ill-posed <strong>Problem</strong>s: Theory <strong>and</strong>Applications, Kluwer, Dordrecht (1994).[3] A.S Leonov <strong>and</strong> A.G Yagola, The L-curve method always introduces a nonremovablesystematic error, Moscow University, Physics Bulletin, Vol. 52,No. 6, pp. 20–23, (1997).[4] M. Gulliksson <strong>and</strong> P.A Wedin, Analyzing <strong>the</strong> nonlinear L-curve, Dept. ofComp. Science, Umea University, Sweden (1998).[5] M. Gulliksson <strong>and</strong> P.A Wedin, <strong>Algorithms</strong> <strong>for</strong> using <strong>the</strong> nonlinear L-curve,Dept. of Comp. Science, Umea University, Sweden (1998).[6] T. Vikl<strong>and</strong>s, Routines <strong>for</strong> constructing approximating L- <strong>and</strong> a-curves, UM-NAD 24.99 (Master Thesis), Dept. of Comp. Science, Umea University,Sweden (1999).[7] T. Reginska, A regularization parameter in discrite ill-posed problems,SIAM J. Sci. Comput., 17(3):223-228, (1996).[8] M. Hanke, Limitations of <strong>the</strong> L-curve in ill-posed problems, BIT, 36:2:287-2´301, (1996).[9] C.R Vogel, Non-convergence of <strong>the</strong> L-curve regularization parameter selectionmethod, Inverse <strong>Problem</strong>s, 12:535-547, (1996).[10] T. Elfving <strong>and</strong> L-E Andersson, An Algorithm <strong>for</strong> computing constrainedsmoothing spline functions, Numer. Math. 52, 583-595 (1988).[11] O. Scherzer <strong>and</strong> M. Gulliksson, Adaptive Strategy <strong>for</strong> Damping Parametersin an Iteratively Regularized Gauss-Newton Method, Institute für Industriema<strong>the</strong>matik,Universität Linz, Linz, Austria. Department of ComputingScience, Umea University, Umea, Sweden.[12] A.S. Leonov <strong>and</strong> A.G. Yagola, Can an ill-posed problem be solved if <strong>the</strong>data error is unknown?, Moscow University, Physics Bulletin, Vol. 50, No.1, pp. 25–28, (1995).


174 Paper VI


Notations to Paper I-IVItem DescriptionA m by m diagonal matrix with elements A = diag(α 1 , ..., α m ) <strong>and</strong>α i ≥ α i+1 > 0.X n by n diagonal matrix with elements X = diag(χ 1 , ..., χ n ) <strong>and</strong>χ i ≥ χ i+1 > 0.D In Paper V, D = A T A (diagonal matrix).Q Matrix Q ∈ R m×n , where n ≤ m, whose columns are orthonormal,i.e., Q T Q = I n .q i The ith column vector in Q.˜Q Mostly used when addressing some orthonormal matrix.ˆQ, ˆQ i A solution to a WOPP, or ith solution if speaking of several minima.G, G i A symmetric solution to <strong>the</strong> CARE.Q ⊥ Orthonormal base <strong>for</strong> <strong>the</strong> null space of Q T .M M ∈ R 3×3 orthogonal matrix with det(M) = 1.I m m by m identity matrix.I m,n I m,n = diag(1, . . .,1) ∈ R m×n .V m,n Stiefel manifold, <strong>the</strong> set of all matrices Q ∈ R m×n with orthonormalcolumns.vec(B) <strong>the</strong> vec-operator is <strong>the</strong> stacking of <strong>the</strong> column vectors in <strong>the</strong> matrixB ∈ R m×n , into a column vector of dimension mn.⊗ Kronecker product.f(Q) A linear function of Q ∈ R m×n written as f(Q) = Fvec(Q).F Matrix of dimension k by mn, in <strong>the</strong> context f(Q) = Fvec(Q).For a WOPP F is a diagonal matrix with k = mn.F The surface of f(Q).T Tangent plane of F at a given point.N Normal space of F at a given point.N p Normal plane of F at a given point.N A component of <strong>the</strong> normal space of F at a given point.T A component of <strong>the</strong> tangent space of F at a given point.175


Item DescriptionS Skew symmetric matrix S = −S T .s Vector used to represent <strong>the</strong> nonzero elements/parameters in S,i.e., a parametrization S(s).J Jacobian of f(Q).H Matrix containing second order in<strong>for</strong>mation of f(Q), used tocompute <strong>the</strong> Hessian of ||f(Q) − b|| 2 2 .e i ith column of an identity matrix.λ Lagrange parameter.UΣV T Singular value decomposition of a m by n matrix. U ∈ R m×m ,Σ ∈ R m×n <strong>and</strong> V ∈ R n×n .Null(Z) Nulls space of Z ∈ R m×n .Range(Z) The range of Z ∈ R m×n .OPP <strong>Orthogonal</strong> <strong>Procrustes</strong> problem.WOPP <strong>Weighted</strong> orthogonal <strong>Procrustes</strong> problem.SVD Singular value decomposition.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!