09.08.2015 Views

Parallel L-BFGS-B Algorithm on GPU

Parallel L-BFGS-B Algorithm on GPU

Parallel L-BFGS-B Algorithm on GPU

SHOW MORE
SHOW LESS
  • No tags were found...

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<str<strong>on</strong>g>Parallel</str<strong>on</strong>g> L-<str<strong>on</strong>g>BFGS</str<strong>on</strong>g>-B <str<strong>on</strong>g>Algorithm</str<strong>on</strong>g> <strong>on</strong> <strong>GPU</strong>AbstractDue to the rapid advance of general-purpose graphics processing unit (<strong>GPU</strong>), it is an active research topic to study performanceimprovement of n<strong>on</strong>-linear optimizati<strong>on</strong> with parallel implementati<strong>on</strong> <strong>on</strong> <strong>GPU</strong>, as attested by the much research <strong>on</strong> parallel implementati<strong>on</strong>of relatively simple optimizati<strong>on</strong> methods, such as the c<strong>on</strong>jugate gradient method. We study in this c<strong>on</strong>text theL-<str<strong>on</strong>g>BFGS</str<strong>on</strong>g>-B method, or the limited memory Broyden-Fletcher-Goldfarb-Shanno with boundaries, which is a sophisticated yet efficientoptimizati<strong>on</strong> method widely used in computer graphics as well as general scientific computati<strong>on</strong>. By analyzing and resolvingthe inherent dependencies of some of its search steps, we propose an efficient <strong>GPU</strong>-based parallel implementati<strong>on</strong> of L-<str<strong>on</strong>g>BFGS</str<strong>on</strong>g>-B<strong>on</strong> the <strong>GPU</strong>. We justify our design decisi<strong>on</strong>s and dem<strong>on</strong>strate significant speed-up by our parallel implementati<strong>on</strong> in solving thecentroidal Vor<strong>on</strong>oi tessellati<strong>on</strong> (CVT) problem as well as some typical computing problems.Keywords:N<strong>on</strong>linear optimizati<strong>on</strong>, L-<str<strong>on</strong>g>BFGS</str<strong>on</strong>g>-B, <strong>GPU</strong>, CVT11. Introducti<strong>on</strong>35this problem and make the following c<strong>on</strong>tributi<strong>on</strong>s:2345678910111213141516171819202122232425262728293031323334N<strong>on</strong>linear energy minimizati<strong>on</strong> is at the core of many algorithmsin graphics, engineering and scientific computing. Dueto their features of rapid c<strong>on</strong>vergence and moderate memoryrequirement for large-scale problems [1], the limited-memoryBroyden-Fletcher-Goldfarb-Shanno (L-<str<strong>on</strong>g>BFGS</str<strong>on</strong>g>) algorithm and itsvariant, the L-<str<strong>on</strong>g>BFGS</str<strong>on</strong>g>-B algorithm [2, 3, 4], are efficient alternativesto other frequently-used energy minimizati<strong>on</strong> algorithmssuch as the c<strong>on</strong>jugate gradient (CG) [5] and Levenberg-Marquardt(LM) [6] algorithm. Furthermore, L-<str<strong>on</strong>g>BFGS</str<strong>on</strong>g>-B is favored as thecore of many state-of-the-art algorithms in graphics, such asthe computati<strong>on</strong> of centroidal Vor<strong>on</strong>oi tessellati<strong>on</strong> (CVT) [7],the mean-shift image segmentati<strong>on</strong> [8], the medical image registrati<strong>on</strong>[9], the face tracking for animati<strong>on</strong> [10], and the compositi<strong>on</strong>of vector textures [11]. Am<strong>on</strong>g these applicati<strong>on</strong>s, thecomputati<strong>on</strong> of CVT is the basis of numerous applicati<strong>on</strong>s ingraphics including flow visualizati<strong>on</strong> [12], image compressi<strong>on</strong>or segmentati<strong>on</strong> [13, 14, 15], surface remeshing [16, 17, 18],object distributi<strong>on</strong> [19], and stylized rendering [20, 21, 22].Hence, an L-<str<strong>on</strong>g>BFGS</str<strong>on</strong>g>-B solver of high performance is desired bythe graphics community for its wide applicati<strong>on</strong>s.L-<str<strong>on</strong>g>BFGS</str<strong>on</strong>g>-B is an iterative algorithm. After initialized with astarting point and boundary c<strong>on</strong>straints, it iterates through fivephases: (1) gradient projecti<strong>on</strong>; (2) generalized Cauchy pointcalculati<strong>on</strong>; (3) subspace minimizati<strong>on</strong>; (4) line searching; and(5) limited-memory Hessian approximati<strong>on</strong>. Recently, therehas been a trend towards the usage of parallel hardware suchas the <strong>GPU</strong> for accelerati<strong>on</strong> of energy minimizati<strong>on</strong> algorithms.Successful examples including the <strong>GPU</strong>-based CG [23, 24] and<strong>GPU</strong>-based LM [25] have dem<strong>on</strong>strated the clear advantages ofparallelizati<strong>on</strong>. However, such parallelizati<strong>on</strong> for L-<str<strong>on</strong>g>BFGS</str<strong>on</strong>g>-B ischallenging since there is str<strong>on</strong>g dependency in some key steps,such as (2) generalized Cauchy point calculati<strong>on</strong>, (3) subspaceminimizati<strong>on</strong>, and (4) line searching. In this paper, we tackle3637383940414243444546474849505152535455565758596061626364• We approximate the generalized Cauchy point with muchless calculati<strong>on</strong> while maintaining a similar rate of c<strong>on</strong>vergence.By doing so, we remove the dependency in thecomputati<strong>on</strong> to make the algorithm suitable for parallelimplementati<strong>on</strong> <strong>on</strong> the <strong>GPU</strong>.• We propose several new <strong>GPU</strong>-friendly expressi<strong>on</strong>s to computethe maximal possible step-length for backtrackingand line searching, making it possible to be calculatedwith parallel reducti<strong>on</strong>.• We dem<strong>on</strong>strate the speedup of L-<str<strong>on</strong>g>BFGS</str<strong>on</strong>g>-B enabled byour parallel implementati<strong>on</strong> with extensive testings andpresent example applicati<strong>on</strong>s to solve some typical n<strong>on</strong>linearoptimizati<strong>on</strong> problems in both graphics and scientificcomputing.In the remainder of this paper, we first briefly review the<str<strong>on</strong>g>BFGS</str<strong>on</strong>g> family and optimizati<strong>on</strong> algorithms <strong>on</strong> the <strong>GPU</strong> in Secti<strong>on</strong>2. Next, we review the L-<str<strong>on</strong>g>BFGS</str<strong>on</strong>g>-B algorithm in Secti<strong>on</strong> 3,and introduce our adaptati<strong>on</strong> <strong>on</strong> the <strong>GPU</strong> in Secti<strong>on</strong> 4. Experimentalresults are given in Secti<strong>on</strong> 5, comparing our implementati<strong>on</strong>with the latest L-<str<strong>on</strong>g>BFGS</str<strong>on</strong>g>-B implementati<strong>on</strong> <strong>on</strong> theCPU [26] using two examples from different fields: the centroidalVor<strong>on</strong>oi tessellati<strong>on</strong> (CVT) problem [7, 27] in graphics,as well as the Elastic-Plastic Torsi<strong>on</strong> problem in the classicalMINPACK-2 test problem set [28] in scientific computing forgenerality. Finally, Secti<strong>on</strong> 6 discusses the limitati<strong>on</strong> of our<strong>GPU</strong> implementati<strong>on</strong> and Secti<strong>on</strong> 7 c<strong>on</strong>cludes the paper withpossible future work. Our prototype is open source and can befree downloaded from Google Code (http://code.google.com/p/lbfgsb-<strong>on</strong>-gpu/).Preprint submitted to Computers & Graphics December 8, 2013


65 2. Related Work66676869707172737475767778798081828384858687We briefly review the previous work <strong>on</strong> Broyden-Fletcher-Goldfarb-Shanno (<str<strong>on</strong>g>BFGS</str<strong>on</strong>g>) algorithm and its extensi<strong>on</strong>s, as wellas previous work <strong>on</strong> <strong>GPU</strong>-based n<strong>on</strong>linear optimizati<strong>on</strong>.2.1. <str<strong>on</strong>g>BFGS</str<strong>on</strong>g> Optimizati<strong>on</strong>The <str<strong>on</strong>g>BFGS</str<strong>on</strong>g> algorithm [29] approximates the Newt<strong>on</strong> methodfor solving several n<strong>on</strong>linear optimizati<strong>on</strong> problems. Since thememory requirement quadratically increases with the problemsize, the <str<strong>on</strong>g>BFGS</str<strong>on</strong>g> algorithm is not suitable for large scale problems.The seminal work by Liu and Nocedal [2] approximatesthe Hessian matrix with reduced memory requirement which islinear in the size of the input variables. Their method is calledthe L-<str<strong>on</strong>g>BFGS</str<strong>on</strong>g> algorithm, where “L” stands limited memory. Inadditi<strong>on</strong>, a bound c<strong>on</strong>strained versi<strong>on</strong> of the L-<str<strong>on</strong>g>BFGS</str<strong>on</strong>g> algorithm,namely the L-<str<strong>on</strong>g>BFGS</str<strong>on</strong>g>-B algorithm, is proposed by Byrd et al. [3],and its implementati<strong>on</strong> in Fortran is given by Zhu et al. [4].Furthermore, there are some variants [30, 31, 32, 33] thatpropose improvements by combining the L-<str<strong>on</strong>g>BFGS</str<strong>on</strong>g> algorithmwith other optimizati<strong>on</strong> methods. Recently, Morales et al. [34,26] improve (currently in versi<strong>on</strong> 3.0) the step of subspace minimizati<strong>on</strong>through a numerical study. We build our prototypebased <strong>on</strong> their code, using the techniques detailed in the nextsecti<strong>on</strong> to parallelize it <strong>on</strong> the <strong>GPU</strong>.118119120121122123124125126127128129130131132133134135of the optimizati<strong>on</strong> work is still <strong>on</strong> the CPU. They used a hybridapproach where <strong>on</strong>ly the evaluati<strong>on</strong> of the target functi<strong>on</strong>and its gradients are implemented <strong>on</strong> the <strong>GPU</strong>, and the rest ofthe optimizati<strong>on</strong> work is still <strong>on</strong> the CPU. There are also someother works [27, 46] followed a similar style. N<strong>on</strong>e of themimplemented the core parts of the optimizati<strong>on</strong> (line searching,subspace minimizati<strong>on</strong>, etc.) <strong>on</strong> the <strong>GPU</strong>. Wetzl et al. [45] introduceda straightforward implementati<strong>on</strong> of the L-<str<strong>on</strong>g>BFGS</str<strong>on</strong>g> algorithm<strong>on</strong> the <strong>GPU</strong> where the boundaries are ignored, whichmade their implementati<strong>on</strong> unavailable for problems with c<strong>on</strong>straints.As far as we know, our method presented in this paperwill be the first method running all the core parts of the L-<str<strong>on</strong>g>BFGS</str<strong>on</strong>g>-B optimizati<strong>on</strong> <strong>on</strong> the <strong>GPU</strong>, except for some high-levelbranching logic c<strong>on</strong>trol.3. <str<strong>on</strong>g>Algorithm</str<strong>on</strong>g>The L-<str<strong>on</strong>g>BFGS</str<strong>on</strong>g>-B algorithm is introduced by Byrd et al. [3].We follow the notati<strong>on</strong> in their paper to briefly introduce thealgorithm in this secti<strong>on</strong>.The L-<str<strong>on</strong>g>BFGS</str<strong>on</strong>g>-B algorithm is an iterative algorithm that minimizesan objective functi<strong>on</strong> x in R n subject to some boundaryc<strong>on</strong>straints l ≤ x ≤ u, where l, x, u ∈ R n . In the k-th iterati<strong>on</strong>,the objective functi<strong>on</strong> is approximated by a quadratic model ata point x k :8889909192939495969798991001011021031041051061071081091101111121131141151161172.2. N<strong>on</strong>linear Optimizati<strong>on</strong> <strong>on</strong> <strong>GPU</strong>The work proposed by Bolz et al. [24] for the first timemapped two computati<strong>on</strong>al kernels of n<strong>on</strong>linear optimizati<strong>on</strong><strong>on</strong> the <strong>GPU</strong>, specifically, a sparse matrix c<strong>on</strong>jugate gradientsolver and a regular grid multi-grid solver. Since then, the topic<strong>on</strong> how to map the c<strong>on</strong>jugate gradient solver efficiently <strong>on</strong> the<strong>GPU</strong> has been extensively studied. Hillesland et al. [35] describeda framework with c<strong>on</strong>jugate gradient method for solvingmany large n<strong>on</strong>linear optimizati<strong>on</strong>s c<strong>on</strong>currently <strong>on</strong> the graphicshardware, which was applied to image-based modeling. Goodnightet al. [36] introduced a multi-grid solver for boundaryvalue problems <strong>on</strong> the <strong>GPU</strong>. Krüger and Westermann [37] alsoproposed some basic operati<strong>on</strong>s of linear algebra <strong>on</strong> the <strong>GPU</strong>and used them to c<strong>on</strong>struct a c<strong>on</strong>gruent gradient solver and aGauss-Seidel solver. Later, Feng and Li [38] implemented amulti-grid solver <strong>on</strong> the <strong>GPU</strong> for power grid analysis and Buatoiset al. [39] presented a sparse linear solver <strong>on</strong> the <strong>GPU</strong>. Recently,in the community of parallel computing the c<strong>on</strong>jugategradient method has been proposed <strong>on</strong> multi-<strong>GPU</strong> [23, 40, 41]or even multi-<strong>GPU</strong> clusters [42]. A complete survey <strong>on</strong> thistopic is bey<strong>on</strong>d the scope of this paper. Please refer to Verschoor’spaper [43] for more details.Besides the c<strong>on</strong>jugate gradient method, Li et al. [25] describeda <strong>GPU</strong> accelerated Levenberg-Marquardt optimizati<strong>on</strong>.There are also some previous attempts <strong>on</strong> parallelizing L-<str<strong>on</strong>g>BFGS</str<strong>on</strong>g>method and its variants (e.g. L-<str<strong>on</strong>g>BFGS</str<strong>on</strong>g>-B method) <strong>on</strong> the <strong>GPU</strong>.Yatawatta et al. [44] implemented <strong>GPU</strong>-accelerated Levenberg-Marquardt and L-<str<strong>on</strong>g>BFGS</str<strong>on</strong>g> optimizati<strong>on</strong> routines. They used ahybrid approach where <strong>on</strong>ly the evaluati<strong>on</strong> of the target functi<strong>on</strong>and its gradients are implemented <strong>on</strong> the <strong>GPU</strong>, and the rest1361371381391401411421431441451461471481491502m k (x) = f (x k ) + g T k (x − x k) + 1 2 (x − x k) T B k (x − x k ), (1)where g k is the gradient at point x k and B k is the limited memory<str<strong>on</strong>g>BFGS</str<strong>on</strong>g> matrix which approximates the Hessian matrix at pointx k . In each iterati<strong>on</strong>, the most crucial phases are: (1) the computati<strong>on</strong>for the generalized Cauchy point; and (2) the subspaceminimizati<strong>on</strong>.3.1. Generalized Cauchy PointTo simplify notati<strong>on</strong>, following [3], we shall drop the indexk of the outer iterati<strong>on</strong> in the rest of this secti<strong>on</strong>. Thus, B, g, x,and ˆm corresp<strong>on</strong>d to B k , g k , x k , and ˆm k used above. Subscriptswill be used to denote the comp<strong>on</strong>ents of a vector, and superscriptsto denote iterati<strong>on</strong> during the search for the generalizedCauchy point. To minimize m k (x) in Eqn. 1, the generalizedCauchy point x c = x(t ∗ ) is computed as the first local minimizert ∗ al<strong>on</strong>g a piece-wise linear path P(t) = (x 0 − tg; l, u) thatis to be described below.Each coordinate x i (t) of the piecewise linear path x(t) is definedasxi 0 − tg i , t ∈ [0, t i ] (2)where the breakpoint t i in each dimensi<strong>on</strong>, which is the boundinduced by the rectangular bounding regi<strong>on</strong> (l, u), is given byt i =⎧⎪⎨⎪⎩(xi 0 − u i )/g i if g i < 0(xi 0 − l i )/g i if g i > 0∞ otherwise. (3)The breakpoints {t i : i = 1, . . . , n} are sorted into an orderedset {t j : t j−1 < t j , j = 2, . . . , n}. To find the minimizer


|t c −t ∗ |t∗× 100%Elastic-Plastic Journal Minimal Optimal 1-D Ginzburg- Lennard-J<strong>on</strong>es Stead-State 2-D Ginzburg-Torsi<strong>on</strong> Bearing Surfaces Design Landau Clusters Combusti<strong>on</strong> Landau0 85.23% 55.14% 100.00% 86.41% 100.00% 100.00% 100.00% 98.70%0∼5% 12.19% 35.26% 0.00% 8.99% 0.00% 0.00% 0.00% 1.30%5%∼10% 1.41% 3.40% 0.00% 1.10% 0.00% 0.00% 0.00% 0.00%10%∼15% 0.35% 1.00% 0.00% 0.80% 0.00% 0.00% 0.00% 0.00%15%∼20% 0.00% 0.70% 0.00% 0.30% 0.00% 0.00% 0.00% 0.00%20%∼25% 0.35% 0.60% 0.00% 0.40% 0.00% 0.00% 0.00% 0.00%25%∼30% 0.12% 0.30% 0.00% 0.20% 0.00% 0.00% 0.00% 0.00%30%∼35% 0.00% 0.50% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00%≥40% 0.35% 3.10% 0.00% 1.80% 0.00% 0.00% 0.00% 0.00%Energy Diff 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00%Table 1: Comparis<strong>on</strong> of our approximated t c and real t ∗ in L-<str<strong>on</strong>g>BFGS</str<strong>on</strong>g>-B algorithm for all the eight minimizati<strong>on</strong> problems in MINPACK-2. The size of all problemsis 40,000, and the boundary is [−1, 1] for all dimensi<strong>on</strong>s. “Energy Diff” means the difference of final energy between using the original generalized Cauchy pointand our approximated <strong>on</strong>e.210211212Then a minimal parallel reducti<strong>on</strong> [47] is performed,α ∗ = min(1, mini(α i )) (7)The last step in each iterati<strong>on</strong> of the L-<str<strong>on</strong>g>BFGS</str<strong>on</strong>g>-B algorithmutilizes a line search to find a new point for next iterati<strong>on</strong>, whichcan similarly be computed with a minimal parallel reducti<strong>on</strong>.2132142152162172182192202212222232242252262272282292302312322332342352362372382392402412422434.3. Implementati<strong>on</strong> DetailsWe implement our <strong>GPU</strong>-based L-<str<strong>on</strong>g>BFGS</str<strong>on</strong>g>-B algorithm usingNVIDIA CUDA language [48]. The first problem to be handledis how to operate between the matrices and vectors, whichis pervasive in the algorithm, especially when calculating theinfinity normal of the projected gradient, the inverse L-<str<strong>on</strong>g>BFGS</str<strong>on</strong>g>matrix and its pre-c<strong>on</strong>diti<strong>on</strong>ers.For large-scale problems, the number of columns of inverseL-<str<strong>on</strong>g>BFGS</str<strong>on</strong>g> matrix and its pre-c<strong>on</strong>diti<strong>on</strong>ers can be much largerthan their number of rows (normally m is between 3 and 8),where m is the maximum dimensi<strong>on</strong> of the Hessian approximati<strong>on</strong>).This kind of matrices are often called “panels” or “multivectors”in the literatures [49, 50], and it is comm<strong>on</strong> to solvethe “panel-panel” multiplicati<strong>on</strong> [49, 50] using dot-products,which is implemented using the latest parallel reducti<strong>on</strong> technique[47] in our implementati<strong>on</strong>. Take the matrices in Fig. 2for an example. To calculate this value, in each thread we samplethe values from both the left and the right matrix, multiplythem (and other calculati<strong>on</strong>s if necessary, such as scaling,adding or dividing by an extra value, etc.) and store the resultin the shared memory. Then a parallel reducti<strong>on</strong> is performedacross the shared memory and the value from the thread whoseindex is zero is stored in the resulting matrix.We have also tested NVIDIA CUBLAS Library [51] for this“panel-panel” multiplicati<strong>on</strong>, and it proved less efficient thanour method (Fig. 3), since it has not been optimized for the matricesof such special dimensi<strong>on</strong>s. The dimensi<strong>on</strong> of the twomatrices in this test are 8 × M and M × 8, where M varies from10 to over 1, 000, 000. The initial values of elements in thesetwo matrices are random numbers in [0, 1]. All the computati<strong>on</strong>sare in the double precisi<strong>on</strong> and the difference between theleft matrix:..................*multiply & store into shared memoryright matrix:...expandresult matrixFigure 2: Matrix-matrix multiplicati<strong>on</strong> using parallel reducti<strong>on</strong>.4


244 results of our method and of CUBLAS is less than 1E − 9. Al-245 though cublasDgemm is faster than our implementati<strong>on</strong> when246 M = 10 (when the matrices are almost square), our implementati<strong>on</strong>outperforms it for several times as so<strong>on</strong> as M ≥ 20.262263264265266267268269270271272273274275276277278279280straints. Variables whose value at the lower or upper boundariesof this regi<strong>on</strong> are held fixed. The set of these variablesis called “active set” [3] indicating the corresp<strong>on</strong>ding boundaryc<strong>on</strong>straints are active. Before the Hessian approximati<strong>on</strong>, thestatus of variables have to be traced to see whether they enteredor left the active set. Instead of searching sequentially for all thevariables, we first mark variables that entered or left the activeset into two arrays, and then perform a parallel scan [53] acrosseach array to transform the boolean marks into sequential indices.Finally, we select the variable as it is marked and putthem in positi<strong>on</strong>s determined by their indices (this operati<strong>on</strong> isoften called “compact” [53]). We use the implementati<strong>on</strong> fromthe Thrust Library [54] for the parallel scan. This process isillustrated in Fig. 4.The L-<str<strong>on</strong>g>BFGS</str<strong>on</strong>g>-B algorithm needs Cholesky factorizati<strong>on</strong> tocompute ˆd u for subspace minimizati<strong>on</strong> [26]. We use Henry’scode [55] to compute Cholesky factorizati<strong>on</strong>. Other linear algebraoperati<strong>on</strong>s such as solving triangular system are performedusing the CUBLAS Library [51].247248249250251252253254255256257258259260261Figure 3: Comparis<strong>on</strong> of throughput for matrix multiplicati<strong>on</strong> using CUBLASLibrary (the cublasDgemm functi<strong>on</strong> call) and our method.According to the literature from Volkov and Demmel [52],in the CUBLAS Library, a matrix is generally treated, and dividedinto blocks, whose result is accumulated by cycling 16×between the rows internally, resulting in a lower parallelism;also for more synchr<strong>on</strong>izati<strong>on</strong> they needs more updates acrossmultiple blocks. This fact may c<strong>on</strong>tribute to the result that ourimplementati<strong>on</strong> can be 16× faster when the number of variablesbecomes larger than 5,000, which is shown in Fig. 3. We havealso experimented the CUSPARSE Library and found that itperformed similar as CUBLAS. The reas<strong>on</strong> is that all the matricesin L-<str<strong>on</strong>g>BFGS</str<strong>on</strong>g>-B are dense. Hence a sparse matrix solver canhardly dem<strong>on</strong>strate its power.E=variables entered the feasible regi<strong>on</strong>L=variables left the feasible regi<strong>on</strong>N=not changedNELNLLENNNNELLLN0100001000010000001011000000111000111112222233330001123333333456variables frombounded to freevariables from free to boundedFigure 4: Managing the status of variables using parallel scan.exclusive scanIn the L-<str<strong>on</strong>g>BFGS</str<strong>on</strong>g>-B algorithm, the steepest descent directi<strong>on</strong>is projected <strong>on</strong>to a feasible regi<strong>on</strong> defined by the boundary c<strong>on</strong>-2812822832842852862872882892902912922932942952962972982993003013023033043053063073085. Applicati<strong>on</strong>sWe compare the efficacy and robustness of our <strong>GPU</strong>-basedL-<str<strong>on</strong>g>BFGS</str<strong>on</strong>g>-B algorithm and the original CPU-based L-<str<strong>on</strong>g>BFGS</str<strong>on</strong>g>-Balgorithm using two applicati<strong>on</strong>s described below. All experimentswere performed with an Intel Xe<strong>on</strong> W5590 @ 3.33GHzand an NVIDIA GTX 580 in double precisi<strong>on</strong>. The CUBLASLibrary and the Thrust Library used are included in CUDAToolkit versi<strong>on</strong> 4.2.5.1. <strong>GPU</strong>-based Centroidal Vor<strong>on</strong>oi Tessellati<strong>on</strong>To explore the power of our <strong>GPU</strong> implementati<strong>on</strong> in graphics,we experimented our L-<str<strong>on</strong>g>BFGS</str<strong>on</strong>g>-B method <strong>on</strong> the centroidalVor<strong>on</strong>oi tessellati<strong>on</strong> (CVT) problem. For the CVT, it is alreadyshown in [27] how to evaluate CVT energy functi<strong>on</strong> and computeits gradient <strong>on</strong> the <strong>GPU</strong>. However, the L-<str<strong>on</strong>g>BFGS</str<strong>on</strong>g>-B iterati<strong>on</strong>sare still performed <strong>on</strong> the CPU in [27]. In the following,we will show how to perform L-<str<strong>on</strong>g>BFGS</str<strong>on</strong>g>-B iterati<strong>on</strong>s <strong>on</strong> the <strong>GPU</strong>as well, and observe resulting the performance gain.Centroidal Vor<strong>on</strong>oi tessellati<strong>on</strong> requires minimizing the followingCVT functi<strong>on</strong> [7]:F(X) =n∑ ∫i=1Ω iρ(x)‖x − x i ‖ 2 dσ, (8)where X = (x 1 , x 2 , . . . , x n ) is an ordered set of n point sites, Ω iis the Vor<strong>on</strong>oi cell of site x i , and ρ(x) is a density functi<strong>on</strong> at x.We use the code from [27] for Vor<strong>on</strong>oi tessellati<strong>on</strong> <strong>on</strong> the<strong>GPU</strong> (VT<strong>GPU</strong> for short in the following) with both CPU L-<str<strong>on</strong>g>BFGS</str<strong>on</strong>g>-B and our <strong>GPU</strong> L-<str<strong>on</strong>g>BFGS</str<strong>on</strong>g>-B implementati<strong>on</strong>s. We alsotransplant all the new “tweaks” to L-<str<strong>on</strong>g>BFGS</str<strong>on</strong>g>-B we made <strong>on</strong> the<strong>GPU</strong> to its implementati<strong>on</strong> <strong>on</strong> the CPU, such as using an approximategeneralized Cauchy point, maximizing the dimensi<strong>on</strong>of Hessian approximati<strong>on</strong>, and maximizing the step lengthin each iterati<strong>on</strong>. This is fair treatment because it has been observedthat these “tweaks” make L-<str<strong>on</strong>g>BFGS</str<strong>on</strong>g>-B faster and simpler5


(a) CPU L-<str<strong>on</strong>g>BFGS</str<strong>on</strong>g>-BQ min= 0.6342Q avg= 0.9164θ min= 36.2932θ avg= 53.24770.20.180.160.140.120.10.080.060.040.0200 20 40 60 80 100Angle Histogram(b) <strong>GPU</strong> L-<str<strong>on</strong>g>BFGS</str<strong>on</strong>g>-BQ min= 0.6383Q avg= 0.9169θ min= 36.6977θ avg= 53.24860.20.180.160.140.120.10.080.060.040.0200 20 40 60 80 100Angle HistogramFigure 5: Delaunay triangulati<strong>on</strong>s of 20, 000 vertices. In the bottom figure <strong>on</strong>ly a part of them are dem<strong>on</strong>strated. The results are generated from CVT using VT<strong>GPU</strong>with either CPU and <strong>GPU</strong> L-<str<strong>on</strong>g>BFGS</str<strong>on</strong>g>-B iterati<strong>on</strong>s. θ is the smallest angle in a triangle; “min”=the minimal value of all triangles in the mesh; “avg”=the average valueof all triangles in the mesh.6


Size Resoluti<strong>on</strong> Evaluati<strong>on</strong>CPU L-<str<strong>on</strong>g>BFGS</str<strong>on</strong>g>-B <strong>GPU</strong> L-<str<strong>on</strong>g>BFGS</str<strong>on</strong>g>-B Energyspd/itritr L-<str<strong>on</strong>g>BFGS</str<strong>on</strong>g>-B (I/O) itr L-<str<strong>on</strong>g>BFGS</str<strong>on</strong>g>-BDifference60,00044.97 28 112.07 (12.81) 28 4.06 27.60× 040,000 43.96 34 74.18 (9.19) 34 3.18 23.33× 020,000 42.43 45 30.09 (4.05) 55 2.19 13.74× -2.31E-0810,000 2,048 41.92 69 15.48 (2.27) 69 1.71 9.05× 08,000 41.87 76 12.61 (1.92) 72 1.60 7.88× -3.38E-086,000 41.84 89 11.2 (1.53) 71 1.53 7.32× 3.79E-074,000 41.38 106 6.51 (1.18) 106 1.40 4.65× 04,0008.85 48 6.21 (1.12) 48 1.33 4.67× 02,000 1,024 8.76 61 3.37 (0.74) 61 1.24 2.72× 01,000 8.70 74 1.81 (0.51) 74 1.16 1.56× 05002.38 56 1.02 (0.38) 56 1.13 0.90× 0512200 2.39 53 0.65 (0.38) 59 1.09 0.60× -4.79E-05Table 2: Statistics of VT<strong>GPU</strong> with the original CPU L-<str<strong>on</strong>g>BFGS</str<strong>on</strong>g>-B implementati<strong>on</strong> and our <strong>GPU</strong> L-<str<strong>on</strong>g>BFGS</str<strong>on</strong>g>-B implementati<strong>on</strong>. All the timings are for each iterati<strong>on</strong> andin millisec<strong>on</strong>ds. “itr” is the number of iterati<strong>on</strong>s; “Evaluati<strong>on</strong>” is the time spent <strong>on</strong> functi<strong>on</strong> and gradient evaluati<strong>on</strong> using the VT<strong>GPU</strong> algorithm; “L-<str<strong>on</strong>g>BFGS</str<strong>on</strong>g>-B” is thetime spent <strong>on</strong> each L-<str<strong>on</strong>g>BFGS</str<strong>on</strong>g>-B iterati<strong>on</strong>, including the time spent <strong>on</strong> data exchanging (denoted by “I/O”); “spd/itr” is the speed-up of L-<str<strong>on</strong>g>BFGS</str<strong>on</strong>g>-B in each iterati<strong>on</strong>.309310311312313314315without compromising its c<strong>on</strong>vergence rate. In both implementati<strong>on</strong>s,we stop the iterati<strong>on</strong> when the decrement of energy isless than 1E − 64. Our experimental results are summarized inTable 2. We record the final CVT energy, iterati<strong>on</strong>s required,and timing data for different parts of both methods, to be detailedin the following. A comparis<strong>on</strong> of the total costs per iterati<strong>on</strong>between [27] and ours is shown in Fig. 8.343344345however, the CPU versi<strong>on</strong> grows much faster — about 20×more than the <strong>GPU</strong> implementati<strong>on</strong>. Fig. 6 compares the twoimplementati<strong>on</strong>s, where the slope of CPU L-<str<strong>on</strong>g>BFGS</str<strong>on</strong>g>-B is 1.61E−3 and the slope of <strong>GPU</strong> L-<str<strong>on</strong>g>BFGS</str<strong>on</strong>g>-B is 8.03E − 5.3163173183193203213223233243253263273283293303313323333343353363373383393403413425.1.1. Jump Flooding <str<strong>on</strong>g>Algorithm</str<strong>on</strong>g>The VT<strong>GPU</strong> uses the jump flooding algorithm (JFA) [56,57, 58] to compute the Vor<strong>on</strong>oi diagram <strong>on</strong> the <strong>GPU</strong>. The timingof this stage is labeled as “Evaluati<strong>on</strong>” in Table 2. Becauseboth competitors use the same code for this stage, theiriterati<strong>on</strong>-wise timings are generally the same.5.1.2. Data ExchangeUsing VT<strong>GPU</strong> with the CPU L-<str<strong>on</strong>g>BFGS</str<strong>on</strong>g>-B also requires dataexchanging between the device and host memory because thecomputati<strong>on</strong> of CVT energy and its gradient is <strong>on</strong> the <strong>GPU</strong>. Wemark the time spent <strong>on</strong> this communicati<strong>on</strong> as “I/O” in Table 2.It increases linearly with both the number of sites and the numberof iterati<strong>on</strong>s. The data exchange costs may occupy morethan 20% of the L-<str<strong>on</strong>g>BFGS</str<strong>on</strong>g>-B time <strong>on</strong> average, which is totallysaved in the <strong>GPU</strong> implementati<strong>on</strong>.5.1.3. L-<str<strong>on</strong>g>BFGS</str<strong>on</strong>g>-B Optimizati<strong>on</strong>The time spent <strong>on</strong> the optimizati<strong>on</strong> is the main part thatmade the difference in the comparis<strong>on</strong>. We break down this partinto the different columns. The number of iterati<strong>on</strong>s is recordedin the “itr” column of Table 2. The <strong>GPU</strong> versi<strong>on</strong> generally requiressimilar iterati<strong>on</strong>s to the CPU versi<strong>on</strong>. Some excepti<strong>on</strong>sare due to the different definiti<strong>on</strong>s of double precisi<strong>on</strong> betweenthe CPU and the <strong>GPU</strong>.The time per iterati<strong>on</strong> of the L-<str<strong>on</strong>g>BFGS</str<strong>on</strong>g>-B algorithm is recordedin the “L-<str<strong>on</strong>g>BFGS</str<strong>on</strong>g>-B” column of Table 2, with the speed-up recordedin the “L-<str<strong>on</strong>g>BFGS</str<strong>on</strong>g>-B” of the “Speed-up” column. Time of the twoimplementati<strong>on</strong>s increases linearly with the number of sites,3463473483493503513523533543553563573583593603617Figure 6: Performance comparis<strong>on</strong> for L-<str<strong>on</strong>g>BFGS</str<strong>on</strong>g>-B iterati<strong>on</strong>s in CVT problem.We also compare our generalized Cauchy point approximati<strong>on</strong>with the CPU implementati<strong>on</strong> without the approximati<strong>on</strong>.The results are listed in Table 3. It is clear that our approximati<strong>on</strong>is satisfying for the CVT problem.5.1.4. C<strong>on</strong>vergenceWe compute the difference between final energies from bothmethods, and record the F <strong>GPU</strong> − F CPU in the “Energy Difference”column of Table 2. The result of the <strong>GPU</strong> versi<strong>on</strong> ismostly the same as the result of the CPU versi<strong>on</strong>. Occasi<strong>on</strong>allythe two implementati<strong>on</strong>s generate different final energies,which indicates that different local minimum points are reached.To visually dem<strong>on</strong>strate this statement, we compare in Fig. 5the two Delaunay triangulati<strong>on</strong>s, which are dual to the CVTs,with 20, 000 sites, as well as their quality measurements. Herethe quality of a triangle is measured by Q = 2 √ 3S/ph [59],


|t c −t ∗ |t∗ × 100%resoluti<strong>on</strong> 512 resoluti<strong>on</strong> 1,024 resoluti<strong>on</strong> 2,048200 500 1000 2000 4000 4000 6000 8000 10000 20000 40000 600000 100.00% 98.39% 100.00% 100.00% 98.41% 99.07% 99.14% 98.72% 98.61% 96.43% 97.37% 97.06%0∼5% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00%5%∼10% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 1.28% 0.00% 1.79% 2.63% 0.00%10%∼15% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00%15%∼20% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 1.39% 0.00% 0.00% 0.00%20%∼25% 0.00% 1.61% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00%25%∼30% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00%30%∼35% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.86% 0.00% 0.00% 0.00% 0.00% 0.00%≥40% 0.00% 0.00% 0.00% 0.00% 1.59% 0.93% 0.00% 0.00% 0.00% 1.79% 0.00% 2.94%Table 3: Comparis<strong>on</strong> of our approximated t c and real t ∗ in L-<str<strong>on</strong>g>BFGS</str<strong>on</strong>g>-B algorithm for VT<strong>GPU</strong> with different sites and different resoluti<strong>on</strong>s. The first row in the headshows resoluti<strong>on</strong>s and the sec<strong>on</strong>d row shows site numbers.362363364365366367368CVT Energy2.4Energy for CPU L−<str<strong>on</strong>g>BFGS</str<strong>on</strong>g>−BEnergy for <strong>GPU</strong> L−<str<strong>on</strong>g>BFGS</str<strong>on</strong>g>−B2.221.81.61.41.210 0 10 1 10 22.6 x 10−4 # iterati<strong>on</strong>sFigure 7: During the optimizati<strong>on</strong>, the CVT energy changes likewise by usingCPU and <strong>GPU</strong>. The figure is generated in the same setting as Fig. 5.where S is the area, p is the half-perimeter, and h is the lengthof the l<strong>on</strong>gest edge. It is clear that the two implementati<strong>on</strong>sgenerate final results of similar quality. Our <strong>GPU</strong> implementati<strong>on</strong>can also guarantee a stable c<strong>on</strong>vergence, as shown in Fig. 7,where the energies from both methods will finally decrease tothe same level, and <strong>on</strong>ly slightly differ during the optimizati<strong>on</strong>process.Time per iterati<strong>on</strong> (ms)160140120100806040R<strong>on</strong>g et al. [27]Ours10 4# sitesFigure 8: Performance comparis<strong>on</strong> (total time in each iterati<strong>on</strong>) between [27]and ours.3693703713723733743753763773783793803813823833843853863873883893903913923933943953963973983994004015.2. Elastic-Plastic Torsi<strong>on</strong> in MINPACK-2For generality, we also evaluate our implementati<strong>on</strong> by solvingthe Elastic-Plastic Torsi<strong>on</strong> problem in the classical MINPACK-2 problem set [28], which is a representative optimizati<strong>on</strong> problemin scientific computing. In this problem the elastic-plasticstress potential in an infinitely l<strong>on</strong>g cylinder is determined whentorsi<strong>on</strong> is applied, which is equivalent to that of minimizing acomplementary energy q <strong>on</strong> a square feasible regi<strong>on</strong> D [60]:∫q(v) = ( 1 2 ‖∇v(x)‖2 − cv(x))dx. (9)DWe use the same Fortran code <strong>on</strong> the CPU for evaluating q andgradient ∇v(x), but use two versi<strong>on</strong>s, specifically, our <strong>GPU</strong> implementati<strong>on</strong>and the original CPU implementati<strong>on</strong> [26] for theL-<str<strong>on</strong>g>BFGS</str<strong>on</strong>g>-B iterati<strong>on</strong>s. Comparis<strong>on</strong>s are presented in Table 4.We solve the 2D problem where the two dimensi<strong>on</strong>s are setequivalent and their multiplicati<strong>on</strong> is the “Size” of the problem.We evaluate from four hundred to four milli<strong>on</strong> variables, usingc = 5.0 (recommended by [60]) for the c<strong>on</strong>stant c in (9).The number of iterati<strong>on</strong>s required by our implementati<strong>on</strong><strong>on</strong> the <strong>GPU</strong> (“itr” columns) is similar to that of the CPU implementati<strong>on</strong>.Besides, the effectiveness of our implementati<strong>on</strong> hasoutperformed the CPU implementati<strong>on</strong> in each iterati<strong>on</strong> (“L-<str<strong>on</strong>g>BFGS</str<strong>on</strong>g>-B” columns), beginning at size=6, 400. While yieldingnearly the same final energy value, our implementati<strong>on</strong> requiresmuch less time. Fig. 9 compares the time per iterati<strong>on</strong> of differentproblem sizes for <strong>GPU</strong> and CPU implementati<strong>on</strong>s. We cansee that the curve for CPU implementati<strong>on</strong> increases roughly29× faster than the <strong>GPU</strong> implementati<strong>on</strong> with the increase ofthe problem size; here we treat the curvature as linear dependence,the slope of the CPU implementati<strong>on</strong> is 8.75E − 4 whilethe slope of the <strong>GPU</strong> implementati<strong>on</strong> is 3.03E −5. The speedupper iterati<strong>on</strong> is recorded in the “spd/itr” column.From the statistical data, we observe again the efficacy ofour approximati<strong>on</strong> to the generalized Cauchy point. The averagestep length for computing the generalized Cauchy point islisted in the “t ∗ ” and “t c ” columns. Clearly, the values for bothmethods are quite similar but our approximati<strong>on</strong> brings aboutc<strong>on</strong>siderable performance gain. Furthermore, our new strategymay even reduce the number of iterati<strong>on</strong>s (c.f. “itr” column)with an equivalent final energy in some cases. Due to the differentdefiniti<strong>on</strong>s of double precisi<strong>on</strong> between CPU and <strong>GPU</strong>,there may be some energy differences. Nevertheless, they are8


SizeEvaluati<strong>on</strong>CPU <strong>GPU</strong> Energyt ∗ itr L-<str<strong>on</strong>g>BFGS</str<strong>on</strong>g>-B t c spd/itritr L-<str<strong>on</strong>g>BFGS</str<strong>on</strong>g>-B (I/O)Difference4,000,000 141.09 0.09 4197 3351.68 0.10 3971 137.16 (17.92) 24.44× 5.88E-112,250,000 79.62 0.09 2917 2200.41 0.10 3213 78.58 (9.97) 28.00× -4.52E-121,440,000 51.29 0.07 2915 1473.86 0.09 2636 51.50 (6.45) 28.62× 8.78E-121,000,000 36.07 0.08 2310 782.62 0.08 1907 36.61 (4.30) 21.38× 7.86E-12640,000 23.12 0.07 1685 427.75 0.08 1793 24.23 (2.70) 17.65× 3.98E-12360,000 13.11 0.08 1272 220.40 0.09 1315 15.27 (1.57) 14.43× 2.44E-12160,000 5.91 0.17 921 117.74 0.18 1035 8.17 (0.63) 14.41× -1.85E-1240,000 1.53 0.24 549 15.09 0.24 464 3.27 (0.20) 4.61× -1.91E-1310,000 0.39 0.26 241 3.59 0.26 237 1.92 (0.19) 1.87× 3.00E-146,400 0.25 0.26 193 2.29 0.26 199 1.60 (0.08) 1.43× -6.90E-143,600 0.14 0.26 144 1.34 0.27 155 1.50 (0.08) 0.89× 1.20E-141,600 0.06 0.26 133 0.93 0.26 105 1.41 (0.05) 0.66× -9.99E-16400 0.02 0.26 52 0.20 0.25 49 1.17 (0.04) 0.17× -9.99E-15Table 4: Statistics for the Elastic-Plastic Torsi<strong>on</strong> problem. All timing data are in millisec<strong>on</strong>ds. “Evaluati<strong>on</strong>” is the time spent <strong>on</strong> functi<strong>on</strong> and gradient evaluati<strong>on</strong>;t ∗ and t c are average step lengths for computing the generalized Cauchy point; “itr” is the number of iterati<strong>on</strong>s; “L-<str<strong>on</strong>g>BFGS</str<strong>on</strong>g>-B” is the time spent <strong>on</strong> each L-<str<strong>on</strong>g>BFGS</str<strong>on</strong>g>-Biterati<strong>on</strong>, including the time spent <strong>on</strong> data exchanging (denoted by “I/O”); “spd/itr” is the speed-up of L-<str<strong>on</strong>g>BFGS</str<strong>on</strong>g>-B in each iterati<strong>on</strong>.402403404405406407408409410411no more than 5.88E − 11, and sometimes the energy computedby our implementati<strong>on</strong> is lower than that produced by the originalimplementati<strong>on</strong> (c.f. negative terms in the “Energy Difference”column)In both experiments, we use a stop criteri<strong>on</strong> less than 1E −64, that is, we run the experiment until the amount of energydecreased is less than 1E − 64. Note that in this experiment,since the computati<strong>on</strong> of the energy functi<strong>on</strong> value and its gradientsare still <strong>on</strong> the CPU, we need to transfer data between theCPU and the <strong>GPU</strong> in each iterati<strong>on</strong>. Even with this overhead,the <strong>GPU</strong> implementati<strong>on</strong> is still faster than the CPU <strong>on</strong>e.418419420421422423424425426427428429430431432433434435the Tesla C2050 has a much higher peak performance <strong>on</strong> thecalculati<strong>on</strong> in double precisi<strong>on</strong> (515GFlops) than the GTX580(193GFlops), its performance <strong>on</strong> running our <strong>GPU</strong> L-<str<strong>on</strong>g>BFGS</str<strong>on</strong>g>-B algorithm is lower. More specifically, the ratio of the performanceof the two cards is exactly the ratio of their memorybandwidth (144GB/sec. vs. 192.4GB/sec.), indicating thememory bandwidth is the bottleneck. Besides, our method stillneeds to read a few scalars from the <strong>GPU</strong> to the CPU in somestages, to c<strong>on</strong>trol the high-level branching logic in the L-<str<strong>on</strong>g>BFGS</str<strong>on</strong>g>-B algorithm. A soluti<strong>on</strong> to these problems is to divide the variablesinto segments, calculate for each, and then combine. Withthis strategy, instead of cycling between stages, <strong>on</strong>e can pack allthe iterati<strong>on</strong>s into a single kernel where the global memory isaccessed <strong>on</strong>ly at the beginning and at the end of the algorithm.However, this soluti<strong>on</strong> requires the evaluati<strong>on</strong> of the functi<strong>on</strong>,as well as the calculati<strong>on</strong> of the gradient vector, is divisible,which is obviously not available to many optimizati<strong>on</strong> problems.4367. C<strong>on</strong>clusi<strong>on</strong> and Future Work412413414415416417Figure 9: Performance comparis<strong>on</strong> for L-<str<strong>on</strong>g>BFGS</str<strong>on</strong>g>-B iterati<strong>on</strong>s in Elastic-PlasticTorsi<strong>on</strong> problem.6. Limitati<strong>on</strong>sCurrently, the performance of our method is limited by thememory bandwidth between the global video memory and the<strong>on</strong>-chip memory (shared memory, registers, etc.). We havealso tested our implementati<strong>on</strong> <strong>on</strong> a Tesla C2050. Although437438439440441442443444445446447448449450In this paper, we presented the first parallel implementati<strong>on</strong>of the L-<str<strong>on</strong>g>BFGS</str<strong>on</strong>g>-B algorithm <strong>on</strong> the <strong>GPU</strong>. Our experimentsshow that our approach makes the L-<str<strong>on</strong>g>BFGS</str<strong>on</strong>g>-B algorithm <strong>GPU</strong>friendlyand easily parallelized, so the time spent <strong>on</strong> solvinglarge-scale optimizati<strong>on</strong>s is radically reduced. Future work includesbreaking the bottleneck of memory bandwidth and exploringthe parallelism of L-<str<strong>on</strong>g>BFGS</str<strong>on</strong>g>-B <strong>on</strong> multiple <strong>GPU</strong>s or evenclusters for problems of larger scales.References[1] ALGLIB Project . Unc<strong>on</strong>strained optimizati<strong>on</strong>: L-<str<strong>on</strong>g>BFGS</str<strong>on</strong>g> and CG.2013. http://www.alglib.net/optimizati<strong>on</strong>/lbfgsandcg.php#header3.[2] Liu DC, Nocedal J. On the limited memory <str<strong>on</strong>g>BFGS</str<strong>on</strong>g> method for large scaleoptimizati<strong>on</strong>. Math Program 1989;45(1):503–28.9


451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521[3] Byrd RH, Lu P, Nocedal J, Zhu C. A limited memory algorithm for boundc<strong>on</strong>strained optimizati<strong>on</strong>. SIAM J Sci Comput 1995;16(5):1190–208.[4] Zhu C, Byrd RH, Lu P, Nocedal J. <str<strong>on</strong>g>Algorithm</str<strong>on</strong>g> 778: L-<str<strong>on</strong>g>BFGS</str<strong>on</strong>g>-B: Fortransubroutines for large-scale bound-c<strong>on</strong>strained optimizati<strong>on</strong>. ACM TransMath Softw 1997;23(4):550–60.[5] Hestenes MR, Stiefel E. Methods of c<strong>on</strong>jugate gradients for solving linearsystems. 1952.[6] Marquardt DW. An algorithm for least-squares estimati<strong>on</strong> of n<strong>on</strong>linearparameters. SIAM J Soc Ind Appl Math 1963;11(2):431–41.[7] Liu Y, Wang W, Lévy B, Sun F, Yan D, Lu L, et al. On centroidalVor<strong>on</strong>oi tessellati<strong>on</strong> – energy smoothness and fast computati<strong>on</strong>. ACMTrans Graph 2009;28(4):101.[8] Yang C, Duraiswami R, DeMenth<strong>on</strong> D, Davis L. Mean-shift analysisusing quasi-Newt<strong>on</strong> methods. In: Proceedings of ICIP ’03; vol. 2. IEEE;2003, p. II–447.[9] Chen YW, Xu R, Tang SY, Morikawa S, Kurumi Y. N<strong>on</strong>-rigid MR-CTimage registrati<strong>on</strong> for MR-guided liver cancer surgery. In: Proceedingsof ICME ’07. IEEE; 2007, p. 1756–60.[10] Hyneman W, Itokazu H, Williams L, Zhao X. Human face project. In:ACM SIGGRAPH ’05 Courses. ACM; 2005, p. 5.[11] Wang L, Zhou K, Yu Y, Guo B. Vector solid textures. ACM Trans Graph2010;29(4):86.[12] Du Q, Wang X. Centroidal Vor<strong>on</strong>oi tessellati<strong>on</strong> based algorithms for vectorfields visualizati<strong>on</strong> and segmentati<strong>on</strong>. In: Proceedings of Vis ’04.IEEE; 2004, p. 43–50.[13] Du Q, Faber V, Gunzburger M. Centroidal Vor<strong>on</strong>oi tessellati<strong>on</strong>s: Applicati<strong>on</strong>sand algorithms. SIAM Rev 1999;41(4):637–76.[14] Du Q, Gunzburger M, Ju L, Wang X. Centroidal Vor<strong>on</strong>oi tessellati<strong>on</strong> algorithmsfor image compressi<strong>on</strong>, segmentati<strong>on</strong>, and multichannel restorati<strong>on</strong>.J Math Imaging Vis 2006;24(2):177–94.[15] Wang J, Ju L, Wang X. An edge-weighted centroidal Vor<strong>on</strong>oi tessellati<strong>on</strong>model for image segmentati<strong>on</strong>. IEEE Trans Image Process2009;18(8):1844–58.[16] Alliez P, De Verdire E, Devillers O, Isenburg M. Isotropic surfaceremeshing. In: Proceedings of SMI ’03. 2003, p. 49–58.[17] Du Q, Wang D. Anisotropic centroidal Vor<strong>on</strong>oi tessellati<strong>on</strong>s and theirapplicati<strong>on</strong>s. SIAM J Sci Comput 2005;26(3):737–61.[18] Lévy B, Liu Y. L p centroidal Vor<strong>on</strong>oi tessellati<strong>on</strong> and its applicati<strong>on</strong>s.ACM Trans Graph 2010;29(4):119.[19] Hiller S, Hellwig H, Deussen O. Bey<strong>on</strong>d stippling methods for distributingobjects <strong>on</strong> the plane. Comput Graph Forum 2003;22(3):515–22.[20] Secord A. Weighted Vor<strong>on</strong>oi stippling. In: Proceedings of NPAR ’02.ACM; 2002, p. 37–43.[21] Battiato S, Di Blasi G, Farinella GM, Gallo G. Digital mosaic frameworks– an overview. In: Comput. Graph. Forum; vol. 26. Wiley Online Library;2007, p. 794–812.[22] Deussen O, Isenberg T. Halft<strong>on</strong>ing and stippling. In: Image and Video-Based Artistic Stylisati<strong>on</strong>. Springer; 2013, p. 45–61.[23] Cevahir A, Nukada A, Matsuoka S. Fast c<strong>on</strong>jugate gradients with multiple<strong>GPU</strong>s. In: Proceedings of ICCS ’09. 2009, p. 893–903.[24] Bolz J, Farmer I, Grinspun E, Schröoder P. Sparse matrix solvers<strong>on</strong> the <strong>GPU</strong>: c<strong>on</strong>jugate gradients and multigrid. ACM Trans Graph2003;22(3):917–24.[25] Li B, Young AA, Cowan BR. <strong>GPU</strong> accelerated n<strong>on</strong>-rigid registrati<strong>on</strong> forthe evaluati<strong>on</strong> of cardiac functi<strong>on</strong>. In: Proceedings of MICCAI ’08. 2008,p. 880–7.[26] Morales JL, Nocedal J. Remark <strong>on</strong> “<str<strong>on</strong>g>Algorithm</str<strong>on</strong>g> 778: L-<str<strong>on</strong>g>BFGS</str<strong>on</strong>g>-B: Fortransubroutines for large-scale bound c<strong>on</strong>strained optimizati<strong>on</strong>”. ACM TransMath Softw 2011;38(1):1–4.[27] R<strong>on</strong>g G, Liu Y, Wang W, Yin X, Gu XD, Guo X. <strong>GPU</strong>-assisted computati<strong>on</strong>of centroidal Vor<strong>on</strong>oi tessellati<strong>on</strong>. IEEE Trans Vis Comput Graph2011;17(3):345–56.[28] Averick BM, Carter RG, Moré JJ, Xue GL. The MINPACK-2 test problemcollecti<strong>on</strong>. Tech. Rep. MCS-P153-0692; Arg<strong>on</strong>ne Nati<strong>on</strong>al Laboratory;1992.[29] Broyden C, Dennis Jr J, Moré J. On the local and superlinear c<strong>on</strong>vergenceof quasi-Newt<strong>on</strong> methods. IMA J Appl Math 1973;12(3):223–45.[30] Jiang L, Byrd RH, Eskow E, Schnabel RB. A prec<strong>on</strong>diti<strong>on</strong>ed L-<str<strong>on</strong>g>BFGS</str<strong>on</strong>g>algorithm with applicati<strong>on</strong> to molecular energy minimizati<strong>on</strong>. Tech. Rep.CU-CS-982-04; Department of Computer Science, University of Colorado;2004.522523524525526527528529530531532533534535536537538539540541542543544545546547548549550551552553554555556557558559560561562563564565566567568569570571572573574575576577578579580581582583584585586587588589590591592[31] Gao G, Reynolds A. An improved implementati<strong>on</strong> of the L<str<strong>on</strong>g>BFGS</str<strong>on</strong>g> algorithmfor automatic history matching. In: Proceedings of ATCE ’04.2004, p. 1–18.[32] Schraudolph N, Yu J, Günter S. A stochastic quasi-Newt<strong>on</strong> method for<strong>on</strong>line c<strong>on</strong>vex optimizati<strong>on</strong>. In: Proceedings of AISTATS ’07. 2007, p.433–40.[33] Liu Y. HL<str<strong>on</strong>g>BFGS</str<strong>on</strong>g>. 2010. http://research.microsoft.com/en-us/UM/people/yangliu/software/HL<str<strong>on</strong>g>BFGS</str<strong>on</strong>g>/.[34] Morales J. A numerical study of limited memory <str<strong>on</strong>g>BFGS</str<strong>on</strong>g> methods. ApplMath Lett 2002;15(4):481–7.[35] Hillesland KE, Molinov S, Grzeszczuk R. N<strong>on</strong>linear optimizati<strong>on</strong> frameworkfor image-based modeling <strong>on</strong> programmable graphics hardware. In:ACM SIGGRAPH ’05 Courses. 2005,.[36] Goodnight N, Woolley C, Lewin G, Luebke D, Humphreys G. A multigridsolver for boundary value problems using programmable graphicshardware. In: Proceedings of HPG ’03. 2003, p. 102–11.[37] Krüger J, Westermann R. Linear algebra operators for <strong>GPU</strong> implementati<strong>on</strong>of numerical algorithms. ACM Trans Graph 2003;22(3):908–16.[38] Feng Z, Li P. Multigrid <strong>on</strong> <strong>GPU</strong>: tackling power grid analysis <strong>on</strong> parallelSIMT platforms. In: Proceedings of ICCAD ’08. 2008, p. 647–54.[39] Buatois L, Caum<strong>on</strong> G, Levy B. C<strong>on</strong>current number cruncher: a <strong>GPU</strong>implementati<strong>on</strong> of a general sparse linear solver. Int J <str<strong>on</strong>g>Parallel</str<strong>on</strong>g>, EmergentDistrib Sys 2009;24(3):205–23.[40] Ament M, Knittel G, Weiskopf D, Strasser W. A parallel prec<strong>on</strong>diti<strong>on</strong>edc<strong>on</strong>jugate gradient solver for the Poiss<strong>on</strong> problem <strong>on</strong> a multi-<strong>GPU</strong> platform.In: Proceedings of PDP ’10. 2010, p. 583–92.[41] Dehnavi M, Fernandez M, Giannacopoulos D. Enhancing the performanceof c<strong>on</strong>jugate gradient solvers <strong>on</strong> graphic processing units. IEEETrans Magn 2011;47(5):1162–5.[42] Cevahir A, Nukada A, Matsuoka S. High performance c<strong>on</strong>jugate gradientsolver <strong>on</strong> multi-<strong>GPU</strong> clusters using hypergraph partiti<strong>on</strong>ing. Comput SciRes Dev 2010;25(1):83–91.[43] Verschoor M, Jalba A. Analysis and performance estimati<strong>on</strong> of the c<strong>on</strong>jugategradient method <strong>on</strong> multiple <strong>GPU</strong>s. <str<strong>on</strong>g>Parallel</str<strong>on</strong>g> Comput 2012;38(10-11):552–75.[44] Yatawatta S, Kazemi S, Zaroubi S. <strong>GPU</strong> accelerated n<strong>on</strong>linear optimizati<strong>on</strong>in radio interferometric calibrati<strong>on</strong>. In: Proceedings of IPC ’12. 2012,p. 1–6.[45] Wetzl J, Taubmann O, Haase S, Köhler T, Kraus M, Hornegger J. <strong>GPU</strong>Accelerated Time-of-Flight Super-Resoluti<strong>on</strong> for Image-Guided Surgery.In: Tolxdorff T, Deserno TM, editors. Bildverarbeitung für die Medizin.2013, p. 21–6.[46] Sellitto M. Accelerating an imaging spectroscopy algorithm for submergedmarine envir<strong>on</strong>ments using heterogeneous computing. Master’sthesis; Department of Electrical and Computer Engineering, NortheasternUniversity; 2012.[47] Harris M. Optimizing parallel reducti<strong>on</strong> in CUDA. NVIDIA Corporati<strong>on</strong>;2007. http://developer.download.nvidia.com/assets/cuda/files/reducti<strong>on</strong>.pdf.[48] CUDA C programming guide. NVIDIA Corporati<strong>on</strong>; 2007.http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html.[49] Gunnels J, Lin C, Morrow G, van de Geijn R. A flexible class of parallelmatrix multiplicati<strong>on</strong> algorithms. In: Proceedings of IPPS/SPDP ’98.1998, p. 110–6.[50] Humphrey J, Price D, Spagnoli K, Paolini A, Kelmelis E. CULA: hybrid<strong>GPU</strong> accelerated linear algebra routines. In: Proceedings of SPIE ’10.2010, p. 770502.[51] CUBLAS Library. NVIDIA Corporati<strong>on</strong>; 2008. http://docs.nvidia.com/cuda/cublas/index.html.[52] Volkov V, Demmel JW. Benchmarking <strong>GPU</strong>s to tune dense linear algebra.In: Proceedings of SC ’08. 2008, p. 31:1–31:11.[53] Sengupta S. Efficient primitives and algorithms for many-core architectures.Ph.D. thesis; University of California, Davis; 2010.[54] Thrust. NVIDIA Corporati<strong>on</strong>; 2009. http://docs.nvidia.com/cuda/thrust/index.html.[55] Henry S. <str<strong>on</strong>g>Parallel</str<strong>on</strong>g>izing Cholesky’s decompositi<strong>on</strong> algorithm. Tech. Rep.;INRIA Bordeaux; 2009.[56] R<strong>on</strong>g G, Tan TS. Jump flooding in <strong>GPU</strong> with applicati<strong>on</strong>s to Vor<strong>on</strong>oidiagram and distance transform. In: Proceedings of I3D ’06. 2006, p.109–16.10


593 [57] R<strong>on</strong>g G, Tan TS. Variants of jump flooding algorithm for computing594 discrete Vor<strong>on</strong>oi diagrams. In: Proceedings of ISVD ’07. 2007, p. 176–595 81.596 [58] Yuan Z, R<strong>on</strong>g G, Guo X, Wang W. Generalized Vor<strong>on</strong>oi diagram compu-597 tati<strong>on</strong> <strong>on</strong> <strong>GPU</strong>. In: Proceedings of ISVD ’11. 2011, p. 75–82.598 [59] Frey P, Borouchaki H. Surface mesh evaluati<strong>on</strong>. In: Proceedings of IMR599 ’97. 1997, p. 363–74.600 [60] Dolan E, Moré J, Muns<strong>on</strong> T. Benchmarking optimizati<strong>on</strong> software with601 COPS 3.0. Tech. Rep. ANL/MCS-TM-273; Arg<strong>on</strong>ne Nati<strong>on</strong>al Labora-602 tory; 2004.11

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!