12.07.2015 Views

new methods for dynamic programming over an infinite time horizon

new methods for dynamic programming over an infinite time horizon

new methods for dynamic programming over an infinite time horizon

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

NEW METHODS FOR DYNAMIC PROGRAMMING OVER ANINFINITE TIME HORIZONa dissertationsubmitted to the department of m<strong>an</strong>agement science <strong>an</strong>d engineering<strong>an</strong>d the committee on graduate studiesof st<strong>an</strong><strong>for</strong>d universityin partial fulfillment of the requirements<strong>for</strong> the degree ofdoctor of philosophyMichael Justin O’Sulliv<strong>an</strong>September 2002


I certify that I have read this dissertation <strong>an</strong>d that in my opinionit is fully adequate, in scope <strong>an</strong>d quality, as a dissertation <strong>for</strong> thedegree of Doctor of Philosophy.Arthur F. Veinott, Jr.(Principal Co-Adviser)I certify that I have read this dissertation <strong>an</strong>d that in my opinionit is fully adequate, in scope <strong>an</strong>d quality, as a dissertation <strong>for</strong> thedegree of Doctor of Philosophy.Michael A. Saunders(Principal Co-Adviser)I certify that I have read this dissertation <strong>an</strong>d that in my opinionit is fully adequate, in scope <strong>an</strong>d quality, as a dissertation <strong>for</strong> thedegree of Doctor of Philosophy.Benjamin V<strong>an</strong> RoyApproved <strong>for</strong> the University Committee on Graduate Studies:ii


AbstractTwo unresolved issues regarding <strong>dynamic</strong> <strong>programming</strong> <strong>over</strong> <strong>an</strong> <strong>infinite</strong> <strong>time</strong> <strong>horizon</strong> are addressedwithin this dissertation. Previous research uses policy improvement to find a strong-present-valueoptimal policy in such systems, but the <strong>time</strong> complexity of policy improvement is not known. Here,a method is presented <strong>for</strong> substochastic systems that breaks the problem of finding a strong-presentvalueoptimal policy into a number of smaller <strong>dynamic</strong> <strong>programming</strong> subproblems. Each of the subproblemsmay be solved using linear <strong>programming</strong>, giving the entire process a polynomial running<strong>time</strong> with regard to the size of the original <strong>dynamic</strong> <strong>programming</strong> problem. Also, <strong>for</strong> unique tr<strong>an</strong>sitionsystems (a specialization of substochastic systems that includes general deterministic systems),other solution <strong>methods</strong> may be applied <strong>an</strong>d the <strong>time</strong> complexity becomes strongly polynomial.For normalized systems, policy improvement is still the method of choice. However, policyimprovement requires the solution of linear systems that may not always have full r<strong>an</strong>k. In a finiteprecision environment, the stability of solution <strong>methods</strong> <strong>for</strong> these linear systems is critical. Onemay simplify the computations associated with policy improvement by classifying the states <strong>an</strong>dconsidering each class separately. Here, a method is presented that applies to <strong>an</strong>y policy withsubstochastic classes. The method uses the state classification to break the linear system into m<strong>an</strong>ysmaller linear systems that are either of full r<strong>an</strong>k or r<strong>an</strong>k-deficient by one. Each of the smaller linearsystems is then solved in a numerically stable way.During the development of the previous numerically stable method, the need <strong>for</strong> a sparse r<strong>an</strong>krevealingLU factorization became apparent. No such factorization exists in current literature. Here,a <strong>new</strong> sparse LU factorization is presented that uses a threshold <strong>for</strong>m of complete pivoting. It isfound to be r<strong>an</strong>k-revealing <strong>for</strong> all but the most pathological matrices.iii


AcknowledgementsI would first like to th<strong>an</strong>k my wife Síobhán <strong>an</strong>d my two children Sophia <strong>an</strong>d Croí. They continue toprovide the love, support <strong>an</strong>d inspiration necessary to my existence. I would also like to th<strong>an</strong>k myparents, Alison <strong>an</strong>d Mike, <strong>an</strong>d my siblings, Sus<strong>an</strong>nah, John <strong>an</strong>d Matthew. They have shaped methroughout my life <strong>an</strong>d <strong>an</strong>y success I have is a direct result of their love <strong>an</strong>d support.Professor Arthur F. Veinott, Jr. introduced me to the subject of <strong>dynamic</strong> <strong>programming</strong> <strong>an</strong>d hasguided me throughout my <strong>time</strong> at St<strong>an</strong><strong>for</strong>d. I am always impressed by the extent of his knowledge.This dissertation is largely a direct result of his patience, insightful mind, <strong>an</strong>d hard work. I wouldlike to express my sincere appreciation <strong>for</strong> all his <strong>time</strong> <strong>an</strong>d ef<strong>for</strong>t.Professor Michael A. Saunders shakes my h<strong>an</strong>d every <strong>time</strong> I see him. He has been a friend aswell as <strong>an</strong> advisor during my <strong>time</strong> at St<strong>an</strong><strong>for</strong>d. Any<strong>time</strong> I had a question about linear algebra, hewould not only supply the <strong>an</strong>swer but also give useful insight into the problem. I would like to th<strong>an</strong>kMike <strong>for</strong> his kindness, thoughtfulness, <strong>an</strong>d hard work throughout my <strong>time</strong> with him.I would like to th<strong>an</strong>k Professor Benjamin V<strong>an</strong> Roy <strong>for</strong> his input during the writing of thisdissertation. His suggestions <strong>an</strong>d comments were always useful <strong>an</strong>d never failed to improve thiswork.I would like to express my th<strong>an</strong>ks to the faculty <strong>an</strong>d staff of the M<strong>an</strong>agement Science <strong>an</strong>dEngineering department at St<strong>an</strong><strong>for</strong>d <strong>for</strong> their help <strong>an</strong>d advice <strong>over</strong> the years. I would also like toth<strong>an</strong>k Julie Ward <strong>an</strong>d Hewlett-Packard <strong>for</strong> <strong>an</strong> exceptional work experience.To my friends from St<strong>an</strong><strong>for</strong>d <strong>an</strong>d the Bay Area—you know who you are—I appreciate everymoment we spent together. I hope I will see you again soon. Special th<strong>an</strong>ks to those from theex-colonies, New Zeal<strong>an</strong>d, South Africa <strong>an</strong>d Australia, <strong>for</strong> helping me maintain my accent <strong>an</strong>d senseof humor. Also, to my friends from the m<strong>an</strong>y football games I played, th<strong>an</strong>ks <strong>for</strong> helping me retainmy s<strong>an</strong>ity through exercise.Finally I would like to th<strong>an</strong>k James Deaker <strong>an</strong>d Karen McNay. James has been a good friendthroughout my <strong>time</strong> at St<strong>an</strong><strong>for</strong>d <strong>an</strong>d I have never quite been able to th<strong>an</strong>k him properly. Th<strong>an</strong>ksmate. Karen adopted a poor, disorg<strong>an</strong>ized Kiwi boy <strong>an</strong>d made me feel like part of the family. Ihope Karen <strong>an</strong>d her family realize how much this me<strong>an</strong>t to me. Th<strong>an</strong>ks to you all.iv


ContentsAbstractAcknowledgementsiiiiv1 Introduction 11.1 Strong-Present-Value Optimality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Substochastic Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3 Normalized Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.4 Dissertation Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 Polynomial-Time Dynamic Programming 52.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.2.1 Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.2.2 Optimality Concepts <strong>an</strong>d Conditions . . . . . . . . . . . . . . . . . . . . . . . 62.2.3 Policy Improvement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.2.4 Maximum-Tr<strong>an</strong>sient-Value Policies . . . . . . . . . . . . . . . . . . . . . . . . 82.2.5 Maximum-Reward-Rate Policies . . . . . . . . . . . . . . . . . . . . . . . . . 82.3 Strong-Present-Value Optimality in Polynomial Time . . . . . . . . . . . . . . . . . . 92.4 Efficient Implementation via State-Classification . . . . . . . . . . . . . . . . . . . . 132.4.1 Recurrent Classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.4.2 Communicating Classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.4.3 Solving the Maximum-Reward-Rate Subproblem . . . . . . . . . . . . . . . . 142.4.4 Solving the Maximum-Tr<strong>an</strong>sient-Value Subproblem . . . . . . . . . . . . . . . 152.5 Strongly Polynomial-Time Methods <strong>for</strong> Unique Tr<strong>an</strong>sition Systems . . . . . . . . . . 152.6 Prior Decomposition Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192.7 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20v


List of Figures2.1 Three Subproblems have Distinct Solutions . . . . . . . . . . . . . . . . . . . . . . . 122.2 Circuit-Rooted Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164.1 Comparing the condition of the factors <strong>for</strong> MS <strong>an</strong>d MCS . . . . . . . . . . . . . . . . 414.2 Comparing the sparsity of the factors <strong>for</strong> MS <strong>an</strong>d MCS . . . . . . . . . . . . . . . . 424.3 Comparing <strong>time</strong> per factorization <strong>for</strong> MS <strong>an</strong>d MCS . . . . . . . . . . . . . . . . . . . 434.4 Comparing the condition of the factors . . . . . . . . . . . . . . . . . . . . . . . . . . 444.5 Comparing the sparsity of the factors . . . . . . . . . . . . . . . . . . . . . . . . . . . 454.6 Comparing the <strong>time</strong> per factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . 464.7 Comparing the condition of the factors as FactorTol ch<strong>an</strong>ges . . . . . . . . . . . . . 474.8 Comparing the sparsity of the factors as FactorTol ch<strong>an</strong>ges . . . . . . . . . . . . . . 484.9 Comparing the <strong>time</strong> per factorization as FactorTol ch<strong>an</strong>ges . . . . . . . . . . . . . . 494.10 Time per factorization as a function of the nonzeros in A . . . . . . . . . . . . . . . 504.11 Time per factorization as a function of the sparsity of the factors . . . . . . . . . . . 51A.1 Counterexample <strong>for</strong> Denardo’s Method . . . . . . . . . . . . . . . . . . . . . . . . . . 52vii


Chapter 1IntroductionM<strong>an</strong>y diverse applications exist <strong>for</strong> <strong>infinite</strong>-<strong>time</strong>-<strong>horizon</strong> <strong>dynamic</strong> <strong>programming</strong> (DP), e.g., developingmedical treatment programs, determining harvest limits, <strong>an</strong>d optimizing stock portfolios. Thisdissertation develops results in <strong>infinite</strong>-<strong>horizon</strong> DP <strong>an</strong>d numerical linear algebra with applications toDP. The <strong>infinite</strong>-<strong>horizon</strong> results consist of polynomial-<strong>time</strong> <strong>an</strong>d strongly polynomial-<strong>time</strong> <strong>methods</strong><strong>for</strong> finding strong-present-value optimal policies in substochastic systems. The numerical linear algebraresults include a stable method <strong>for</strong> calculating the present-value Laurent exp<strong>an</strong>sion coefficients<strong>for</strong> policies with substochastic classes. This method benefits from the other numerical linear algebraresult, a sparse r<strong>an</strong>k-revealing LU factorization.The rest of this chapter is org<strong>an</strong>ized as follows. Section 1.1 discusses the appropriateness ofstrong-present-value optimality as the optimality criterion of choice. Previous research on substochasticsystems is introduced in §1.2 <strong>an</strong>d the lack of <strong>time</strong> complexity results discussed. Section1.3 introduces the extension of the work on substochastic systems to normalized systems, <strong>an</strong>d discussesimplementation issues arising from a finite-precision environment. Finally, §1.4 summarizesthe contributions of this dissertation in more detail.1.1 Strong-Present-Value OptimalityConsider a finite-state-<strong>an</strong>d-action Markov decision chain (referred to here as a DP system, or system<strong>for</strong> short). How does one solve the DP problem of selecting a policy to control this process? Forfinite <strong>time</strong> <strong>horizon</strong>s, the total (expected) reward a policy earns <strong>over</strong> the life<strong>time</strong> of the process givesa measure with which to compare policies. However, <strong>an</strong> <strong>infinite</strong> <strong>time</strong> <strong>horizon</strong> may allow policies toearn <strong>infinite</strong> total reward. Previous research uses concepts such as maximum reward rate, presentvalueoptimality, <strong>an</strong>d strong-present-value optimality to differentiate between policies in the <strong>infinite</strong><strong>horizon</strong> case.The concept of maximum reward rate compares policies by looking at the average reward per1


CHAPTER 1. INTRODUCTION 2period each policy earns <strong>over</strong> the life<strong>time</strong> of the process. However, this optimality concept ignores theinitial (possibly tr<strong>an</strong>sient) behavior of a policy, focusing on long-term rewards. On the other h<strong>an</strong>d,present-value optimality assumes a (positive) interest rate <strong>an</strong>d compares policies by discounting allfuture rewards to the present. This optimality concept focuses on short-term rewards generated by apolicy, since the import<strong>an</strong>ce of <strong>an</strong>y long-term behavior will be reduced by discounting. Under strongpresent-valueoptimality, optimal policies are those that simult<strong>an</strong>eously maximize present-value <strong>for</strong>all small (positive) interest rates. Such policies also earn maximum reward rate. Strong-presentvalueoptimality considers short, intermediate, <strong>an</strong>d long-term behavior when comparing policies.1.2 Substochastic SystemsIn a substochastic system, the tr<strong>an</strong>sition matrix <strong>for</strong> each stationary policy is substochastic. Blackwell[Bla62] introduces the problem of finding a strong-present-value optimal policy, i.e., a policy that hasmaximum present value <strong>for</strong> all sufficiently small positive interest rates, <strong>an</strong>d shows non-constructivelythat a stationary strong-present-value optimal policy exists. Miller <strong>an</strong>d Veinott [MV69] give aconstructive proof by developing a policy-improvement method <strong>for</strong> finding such a policy. Thismethod exploits the fact that the present value of the expected rewards that a policy earns has aLaurent exp<strong>an</strong>sion in the interest rate.Previous work builds towards strong-present-value optimality by developing the weaker conceptof finding a stationary policy that is n-optimal, i.e., lexicographically maximizes the first n + 2coefficients of the Laurent exp<strong>an</strong>sion of the present value. Where n = −1, this is the classic conceptof finding a stationary maximum-reward-rate policy. Howard [How60] 1 <strong>an</strong>d Blackwell [Bla62] showhow to solve that problem by policy improvement. M<strong>an</strong>ne [M<strong>an</strong>60], de Ghellinck [dG60], Derm<strong>an</strong>[Der62] <strong>an</strong>d Denardo [Den70b] develop one linear <strong>programming</strong> method <strong>for</strong> doing this, <strong>an</strong>d Balinski[Bal61], Denardo <strong>an</strong>d Fox [DF68], <strong>an</strong>d Hordijk <strong>an</strong>d Kallenberg [HK79] give <strong>an</strong>other. Blackwell[Bla62] introduces the stronger concept of finding a stationary 0-optimal policy. Veinott [Vei66]shows how to solve this problem by policy improvement <strong>an</strong>d Denardo [Den70a] shows how to do thisby linear <strong>programming</strong>.Veinott [Vei66, Vei68, Vei69, Vei74] introduces the problem of finding a stationary n-optimalpolicy <strong>an</strong>d shows its equivalence to finding a stationary n-present-value (resp., n-Cesàro-<strong>over</strong>taking)optimal policy. A stationary policy is strong-present-value optimal if <strong>an</strong>d only if it is S-optimalwhere S is the number of states. The papers [Vei69, Vei74] refine the policy-improvement algorithmof [MV69] to find a stationary n-optimal policy <strong>an</strong>d suggest that a good way to implement thatalgorithm is to find a stationary m-optimal policy <strong>for</strong> m = −1, . . . , n in that order.Veinott [Vei69] <strong>an</strong>d Avrachenkov <strong>an</strong>d Altm<strong>an</strong> [AA98] show how to find <strong>an</strong> n-optimal policy inspecialized systems (tr<strong>an</strong>sient <strong>an</strong>d irreducible, respectively) using a sequence of smaller problems1 See [Vei66, pp. 1291–1293] <strong>for</strong> a development that fills a lacunae in the proof in [How60].


CHAPTER 1. INTRODUCTION 3that may be solved using linear <strong>programming</strong>.Denardo [Den71, p. 491] sketches a method <strong>for</strong> finding <strong>an</strong> n-optimal policy using either policyimprovement or a combination of linear <strong>programming</strong> <strong>an</strong>d some intermediate algorithms to solvea sequence of simpler DP problems. However, his method is subject to counterexample. Indeed,appendix A gives <strong>an</strong> example in which his method fails to find a 1-optimal policy.No previous work successfully shows how to use linear <strong>programming</strong> alone to find either <strong>an</strong> n-optimal policy <strong>for</strong> n > 0 or a strong-present-value optimal policy. Since the <strong>time</strong> complexity ofpolicy improvement is unknown, the <strong>time</strong> required to find a strong-present-value optimal policy isnot currently known.1.3 Normalized SystemsIn a normalized system, the tr<strong>an</strong>sition matrix of every stationary policy has spectral radius notexceeding one. As Rothblum [Rot75] notes, Blackwell’s [Bla62] proof of existence of a stationarystrong-present-value optimal policy <strong>for</strong> substochastic systems extends in a straight<strong>for</strong>ward way tonormalized systems. Rothblum [Rot75] extends the <strong>methods</strong> from Miller <strong>an</strong>d Veinott [MV69] <strong>an</strong>dVeinott [Vei69, Vei74] by giving <strong>an</strong> augmented Laurent exp<strong>an</strong>sion of the present value <strong>an</strong>d generalizingtheir policy improvement method to normalized systems.The (more general) policy improvement requires the coefficients of the (augmented) Laurent exp<strong>an</strong>sion.These coefficients are the unique solution of a set of linear equations (identical to thosefrom Veinott [Vei69] except <strong>for</strong> the number of arbitrary variables). Veinott [Vei69] shows how toefficiently solve these linear equations (<strong>for</strong> substochastic systems) by identifying recurrent classes,solving within each such (stochastic, irreducible) class by repeated application of a Gaussi<strong>an</strong> eliminationmatrix, <strong>an</strong>d using these solutions to solve the remainder of the system. Rothblum [Rot75]notes that Veinott’s [Vei69] method may be extended to normalized systems, but this requires extrawork to partition the system into communicating classes <strong>an</strong>d identify which classes are recurrent. Noprevious work discusses the numerical properties of the linear equations (although Veinott [Vei69]uses the linear dependence of these equations within recurrent classes) or the numerical stability ofsolution <strong>methods</strong>.1.4 Dissertation OutlineChapter 2 presents a <strong>new</strong> method <strong>for</strong> finding <strong>an</strong> n-optimal (resp., strong-present-value optimal) policywithin a substochastic system by solving a sequence of 3n+5 simpler, well-studied DP subproblems.Previous work shows that these subproblems may be solved using either policy improvementor linear <strong>programming</strong>. Since linear <strong>programming</strong> c<strong>an</strong> be solved in <strong>time</strong> that is polynomial in thesize of the problem, this method finds <strong>an</strong> n-optimal (resp., strong-present-value optimal) policy in


CHAPTER 1. INTRODUCTION 4polynomial <strong>time</strong> when using linear <strong>programming</strong>.For a special refinement of substochastic systems, referred to here as unique tr<strong>an</strong>sition systems,this method finds <strong>an</strong> n-optimal (resp., strong-present-value optimal) policy in strongly polynomial<strong>time</strong>. This result is especially appealing because the refinement includes general deterministic systems.Chapter 3 presents a <strong>new</strong> method similar to Veinott’s [Vei69] <strong>for</strong> determining the coefficients ofthe Laurent exp<strong>an</strong>sion in a system with substochastic classes. This method reduces the problem tothat of finding the coefficients within a number of irreducible, substochastic systems. By partitioningthe state space into communicating classes <strong>an</strong>d following the induced dependence partial ordering,the coefficients <strong>for</strong> the entire system are found by considering each class separately. Within each(irreducible, substochastic) class, the <strong>new</strong> method determines if the class is recurrent or tr<strong>an</strong>sientusing a r<strong>an</strong>k-revealing LU (RRLU) factorization, then computes the coefficients by solving the linearequations (<strong>for</strong> that class) in a numerically stable way.The research contained in Chapter 4 arose during the development of the numerically stablecomputations from Chapter 3. In large, sparse DP problems, solving the linear equations (<strong>for</strong> eachclass) may require factorizing a large, sparse matrix with a possible singularity. Since m<strong>an</strong>y ofthese coefficient matrices are encountered within each iteration, a fast, sparse RRLU factorizationis desirable. Ch<strong>an</strong> [Cha84], Hw<strong>an</strong>g, Lin <strong>an</strong>d Y<strong>an</strong>g [HLY92], <strong>an</strong>d Hw<strong>an</strong>g <strong>an</strong>d Lin [HL97] give denseRRLU factorizations. The LUSOL package of Gill, Murray, Saunders, <strong>an</strong>d Wright [GMSW87]contains a fast, sparse LU factorization using threshold partial pivoting. This is some<strong>time</strong>s, butnot always, r<strong>an</strong>k-revealing. Chapter 4 shows that adding a threshold <strong>for</strong>m of complete pivoting toLUSOL gives a sparse RRLU factorization. This is compared with several other factorizations usingdifferent measures.


Chapter 2Polynomial-Time DynamicProgramming2.1 IntroductionThis chapter presents a method that finds a strong-present-value optimal policy <strong>for</strong> a finite-state<strong>an</strong>d-action<strong>infinite</strong>-<strong>horizon</strong> Markov decision chain in polynomial <strong>time</strong>. Using the weaker notion ofn-optimality, this <strong>new</strong> method starts with <strong>an</strong> arbitrary policy <strong>an</strong>d sequentially improves the policyuntil a strong-present-value optimal policy is found. The problem of finding <strong>an</strong> n-optimal policyfrom <strong>an</strong> (n−1)-optimal policy is decomposed into three simpler subproblems, each of which may besolved in polynomial <strong>time</strong> using linear <strong>programming</strong>. If n is the number of states the method yieldsa strong-present-value optimal policy in polynomial <strong>time</strong>.Section 2.2 <strong>for</strong>mulates the problem of finding a strong-present-value optimal policy. That sectiondefines the system, introduces relev<strong>an</strong>t optimality concepts <strong>an</strong>d conditions, <strong>an</strong>d discusses previousapproaches <strong>for</strong> finding such policies. It also introduces the subproblems at the core of the <strong>new</strong>method. The method itself is introduced in §2.3, taking the <strong>for</strong>m of a theorem that demonstrateshow to move from <strong>an</strong> (n − 1)-optimal policy to a n-optimal policy using three simple (<strong>dynamic</strong><strong>programming</strong>) subproblems. The extension of this theorem to find a strong-present-value optimalpolicy in polynomial <strong>time</strong> is then given. Next, by classifying the states, §2.4 shows how to efficientlyimplement the <strong>new</strong> method. The running <strong>time</strong> of this method becomes strongly polynomial <strong>for</strong>unique tr<strong>an</strong>sition systems, a refinement of the original system definition that includes st<strong>an</strong>darddeterministic finite-state-<strong>an</strong>d-action <strong>infinite</strong>-<strong>horizon</strong> <strong>dynamic</strong> programs. Section 2.5 describes threedifferent <strong>for</strong>ms of these systems, <strong>an</strong>d gives the <strong>new</strong> running <strong>time</strong> in each case. Finally, §2.6 comparesearlier decomposition <strong>methods</strong> with the <strong>new</strong> method <strong>an</strong>d §2.7 summarizes the contributions withinthis chapter.5


CHAPTER 2. POLYNOMIAL-TIME METHODS 62.2 Preliminaries2.2.1 FormulationConsider a substochastic system observed in periods 1, 2, . . . . In each period, the system is in oneof a finite collection S of S states or is in the stopped state. In each state s ∈ S, a finite set A s ofactions is available. The set of state-action pairs {(s, a) | s ∈ S, a ∈ A s } has cardinality A. If oneobserves in a period that the system is in state s ∈ S <strong>an</strong>d chooses action a ∈ A s , the system earnsexpected reward r(s, a) <strong>an</strong>d next period moves to state t ∈ S with probability p(t|s, a) ≥ 0 <strong>an</strong>d thestopped state with probability 1 − ∑ t∈Sp(t|s, a). Once a system enters the stopped state it remainsthere <strong>an</strong>d earns no further reward.Since Blackwell [Bla62] proves the existence of a stationary strong-present-value optimal policy,it suffices to restrict attention to stationary policies within this chapter. Throughout, a (stationary)policy is a function δ that assigns <strong>an</strong> action δ s ∈ A s to each state s ∈ S. Each policy δ induces <strong>an</strong>S-vector r δ ≡ (r(s, δ s )) of expected rewards <strong>an</strong>d <strong>an</strong> S × S tr<strong>an</strong>sition matrix P δ ≡ (p(t|s, δ s )). Let ∆be the set of all policies.The N-period tr<strong>an</strong>sition matrix <strong>for</strong> a policy δ is the N th power PδN of P δ . This leads to thedefinition of a stationary tr<strong>an</strong>sition matrix Pδ∗ ≡ lim N→∞ 1 ∑ ∞N i=1 P i−1δ<strong>for</strong> a policy δ. Now eachpolicy δ partitions S into the tr<strong>an</strong>sient states T δ ≡ {t ∈ S | Pδst ∗ = 0 <strong>for</strong> all s ∈ S} under δ <strong>an</strong>d therecurrent states R δ ≡ S \ T δ under δ. 1 If T δ = S (i.e., Pδ ∗ = 0), then δ is said to be tr<strong>an</strong>sient.2.2.2 Optimality Concepts <strong>an</strong>d ConditionsStrong-present-value optimality.Suppose that rewards carried from one period to the nextearn interest at the rate 100ρ% (ρ > 0) <strong>an</strong>d let β ≡ 11+ρbe the discount factor. The present valueV ρδof a policy δ is the (expected) present value of the rewards that δ earns in each period discountedto the beginning of period 0, i.e., V ρδ≡ ∑ ∞N=1 βN P N−1δr δ . A policy δ is present-value optimal ifV ρδ≥ V γρ <strong>for</strong> all γ ∈ ∆. Finally, δ is strong-present-value optimal if it is present-value optimal <strong>for</strong> allsufficiently small ρ.n-Optimality.It is computationally challenging to discern directly whether or not a policy isstrong-present-value optimal. However, building a sequence of n-optimal policies is more efficient<strong>an</strong>d attains strong-present-value optimality (when n = S).A policy δ is n-present-value optimal iflim ρ −n ( V ρδρ↓0− V γρ )≥ 0 <strong>for</strong> all γ ∈ ∆. (2.1)1 The classification of states as tr<strong>an</strong>sient or recurrent under a given policy leads to the concept of recurrent classesdefined in §2.4.


CHAPTER 2. POLYNOMIAL-TIME METHODS 7EvidentlyV ρδ= βr δ + βP δ V ρδ . (2.2)Also, V ρδ has the Laurent exp<strong>an</strong>sion V ρδ = ∞∑n=−1ρ n v n δ (2.3)in small ρ > 0 [MV69]. By substituting (2.3) into (2.2), multiplying by 1+ρ <strong>an</strong>d equating coefficientsof like powers of ρ, ones sees that V n+1 ≡ ( v −1 , v 0 , . . . , v n+1) = ( v −1δ, vδ 0, . . . , ) vn+1 δ satisfiesr j δ + Q δv j = v j−1 , j = −1, 0, . . . , n + 1, (2.4)where Q δ = P δ − I, rδ 0 = r δ, r j δ= 0, j ≠ 0, <strong>an</strong>d v−2δ= 0 [MV69]. Conversely, if the matrixV n+1 ≡ ( V n , v n+1) satisfies (2.4) then V n = Vδn ≡ ( v −1δ, vδ 0, . . . , ) vn δ [Vei69]. Thus, (2.4) uniquelydetermines the vector V n = Vδ n,but not vn+1 .Writing B ≽ C <strong>for</strong> two matrices of like size me<strong>an</strong>s that each row of B − C is lexicographicallynonnegative. Combining (2.1) <strong>an</strong>d (2.3), a policy δ is n-present-value optimal if <strong>an</strong>d only if δ isn-optimal, i.e., Vδn ≽ V γ n <strong>for</strong> all γ ∈ ∆.Denote the set of n-optimal policies by ∆ n <strong>an</strong>d notice that the n-optimal sets are nested, i.e.,∆ ⊇ ∆ −1 ⊇ ∆ 0 ⊇ ∆ 1 ⊇ · · · . Extending [MV69], [Vei69] shows that there exists <strong>an</strong> 1 ≤ m ≤ S suchthat ∆ ⊇ ∆ −1 ⊇ · · · ⊇ ∆ m = ∆ m+1 = · · · . More<strong>over</strong>, <strong>an</strong> S-optimal policy is strong-present-valueoptimal.2.2.3 Policy ImprovementVeinott [Vei69, Vei74] refines the policy improvement method given in [MV69] to find <strong>an</strong> n-optimalpolicy as follows. For γ, δ ∈ ∆, letg j γδ ≡ rj γ + Q γ v j δ − vj−1 δ, j = −1, 0, . . . (2.5)<strong>an</strong>d G n γδ ≡ (g−1 γδ , . . . , gn γδ) <strong>for</strong> n ≥ −1. If δ is n-optimal, thenG n γδ ≼ 0 <strong>for</strong> all γ ∈ ∆. (2.6)Conversely, ifG n+1γδ≼ 0 <strong>for</strong> all γ ∈ ∆, (2.7)then δ is n-optimal. There<strong>for</strong>e, (2.6) is necessary <strong>an</strong>d (2.7) is sufficient <strong>for</strong> n-optimality of δ. Wesay that δ satisfies the n-optimality conditions if (2.7) holds. If G n γδ≻ 0 <strong>for</strong> some γ ∈ ∆ <strong>an</strong>dγ s = δ s wherever G n nγδs= 0, call γ <strong>an</strong> n-improvement of δ. In that event Vγ≻ Vδ n . If δ has no(n+1)-improvement, then it satisfies the n-optimality conditions.


CHAPTER 2. POLYNOMIAL-TIME METHODS 8Let E −2δ≡ ∆ <strong>an</strong>d Eδ n ≡ {γ ∈ ∆ : Gn γδ = 0} <strong>for</strong> n ≥ −1, so E δ n n−1= {γ ∈ Eδ : gγδ n = 0} <strong>for</strong> n ≥ −1.Since V n = Vδ n satisfies (2.4) <strong>for</strong> −1 ≤ j ≤ n, δ ∈ Eδ n . Also, <strong>for</strong> <strong>an</strong>y two policies γ <strong>an</strong>d δ, Lemma 9of [Vei69, p. 1648] implies that Vγn − Vδ n is a linear function of G n+1 . In particular,γ ∈ E n δimplies that V n−1γ= V n−1δ<strong>an</strong>d vγ n − vδ n = Pγ ∗ g n+1γδ= Pγ∗ (rn+1γ − vδn )γδ(2.8a)<strong>an</strong>d soV n−1γ= V n−1δimplies that γ ∈ E n−1δ<strong>an</strong>d 0 = v n−1γ− v n−1δ= P ∗ γ g n γδ. (2.8b)It is immediate from the first assertion in (2.8a) <strong>an</strong>d the definitions involved thatif δ is (n−1)-optimal, then E n δ⊆ ∆ n−1 ⊆ E n−1δ. (2.9)Policy improvement may be implemented inductively as follows [Vei69, Vei74]. Given a policyδ that is (n−1)-optimal, search E n−1δ<strong>for</strong> <strong>an</strong> (n+1)-improvement γ of δ (so γ is (n−1)-optimal),replace δ by γ, <strong>an</strong>d repeat this procedure until a policy is found with no (n+1)-improvement. Thelast policy satisfies the n-optimality conditions <strong>an</strong>d so is n-optimal.2.2.4 Maximum-Tr<strong>an</strong>sient-Value PoliciesIf δ is tr<strong>an</strong>sient (see §2.2.1), its value V δ = ∑ ∞i=1 P i−1δr δ is finite, <strong>an</strong>d δ has maximum tr<strong>an</strong>sient valueif V δ ≥ V γ <strong>for</strong> all tr<strong>an</strong>sient γ ∈ ∆. The maximum-tr<strong>an</strong>sient-value problem is to find a tr<strong>an</strong>sientpolicy with maximum value among all such policies. Note that there is no requirement that everypolicy be tr<strong>an</strong>sient. The paper [EV75] establishes the following facts about this problem. A policyδ has maximum tr<strong>an</strong>sient value if <strong>an</strong>d only if δ is tr<strong>an</strong>sient <strong>an</strong>d V δ is the least fixed point of theoptimal return operator R defined by RV ≡ max γ∈∆ (r γ + P γ V ) <strong>for</strong> V ∈ R S . Such a policy existsif <strong>an</strong>d only if there is a tr<strong>an</strong>sient policy <strong>an</strong>d R has <strong>an</strong> excessive point, i.e., a V ∈ R S <strong>for</strong> whichV ≥ RV . It is possible to find a maximum-tr<strong>an</strong>sient-value policy either by policy iteration or bysolving a linear program with S rows <strong>an</strong>d A columns similar to that of d’Epenoux [d’E60].2.2.5 Maximum-Reward-Rate PoliciesThe reward rate, i.e., the expected reward per period, of δ is v −1δ= Pδ ∗ r δ . The maximum-reward-rateproblem is to find a policy with maximum reward rate. This problem is well-studied <strong>an</strong>d previouswork shows that there is such a policy <strong>an</strong>d it c<strong>an</strong> be found by policy iteration [How60, Bla62] or bysolving a linear program [Bal61, DF68, HK79] with 2S rows <strong>an</strong>d 2A columns.


CHAPTER 2. POLYNOMIAL-TIME METHODS 92.3 Strong-Present-Value Optimality in Polynomial TimeGiven <strong>an</strong> (n−1)-optimal policy, this section shows how to construct <strong>an</strong> n-optimal policy. It followsthat the construction of such a policy takes polynomial <strong>time</strong>, so a strong-present-value optimal policymay be found in polynomial <strong>time</strong>.V n∗To that end, let (v∗−1 , v∗, 0 v∗, 1 . . . ) be the lexicographic maximum of (v −1δ, vδ 0, v1 δ, . . . ) <strong>over</strong> δ ∈ ∆,≡ (v∗ −1 , . . . , v∗ n ) <strong>an</strong>d E∗ n ≡ {γ ∈ E∗n−1 : rγ n + Q γ v∗ n − v∗n−1 = 0} <strong>for</strong> n ≥ −1 where v∗−2 ≡ 0 <strong>an</strong>dE∗ −2 ≡ ∆.For each policy ζ, observe that Eζ n has the product property, i.e., is a product E ζ n = × s∈S Eζsnof subsets Eζs n ⊆ A s of actions <strong>for</strong> each state s ∈ S. Let E n ζbe a set of policies such that(i) E n ζ has the product property E n ζ = × s∈S E n ζs <strong>for</strong> some subsets ∅ ≠ E n ζs ⊆ E n ζs<strong>for</strong> all s ∈ S, <strong>an</strong>d(ii) <strong>for</strong> each s ∈ S, E n ζs contains each action that is used by some policy in E n ζrecurrent under that policy.<strong>for</strong> which state s isThus, ∅ ≠ E n ζ ⊆ E ζ n. The simplest example of such a set E n ζ is E ζn itself. Section 2.4 gives otherexamples that lead to subst<strong>an</strong>tially more efficient algorithms <strong>for</strong> solving the maximum-reward-ratesubproblem in (b) of Theorem 2.1 below. Note that ζ /∈ E n ζis possible.Theorem 2.1 Sequential Decomposition into Subproblems.Suppose n ≥ −1.of(a) (n−1)-Optimality Conditions. If δ is (n−1)-optimal, then v n = v n ζv n =is the least solution(max rnγ − v n−1γ∈E∗n−1 ∗ + P γ v n) ∨ vδ n (2.10)<strong>for</strong> some ζ ∈ E∗n−1 that is tr<strong>an</strong>sient on the states s <strong>for</strong> which vζs n > vn δs<strong>an</strong>d that agrees with δotherwise. More<strong>over</strong>, each such γ = ζ maximizes vγ n <strong>over</strong> all such γ <strong>an</strong>d satisfies the (n − 1)-optimality conditions. 2(b) n-Optimality <strong>for</strong> Recurrent States. If ζ satisfies the (n−1)-optimality conditions, thenmax vγ∈E n γ n = vζ n + max P ∗ (ζγ∈E n γ rn+1γ − v n )ζ . (2.11)ζMore<strong>over</strong>, the sets of maximizers on both sides of (2.11) are nonempty <strong>an</strong>d coincide. Also, eachsuch common maximizer γ = η is (n−1)-optimal <strong>an</strong>d is n-optimal on each state that is recurrentunder some n-optimal policy.2 It is not efficient to implement part (a) when n = −1. In that case, it is better to omit (a) <strong>an</strong>d begin with amodification of part (b) that omits the hypothesis about ζ, replaces v −1ζ by 0 <strong>an</strong>d E −1ζ by ∆. Then the modifiedversion of (b) evidently holds. If also one sets E −1ζ = ∆, then the policy γ = η found in (b) is −1-optimal, so part(c) is unnecessary. However, as §2.4 discusses, it is necessary to solve subproblem (c) <strong>for</strong> other (often better) choicesof E −1ζ .


CHAPTER 2. POLYNOMIAL-TIME METHODS 10(c) n-Optimality. If η is (n−1)-optimal <strong>an</strong>d is n-optimal on each state that is recurrent undersome n-optimal policy, then v n = v n θv n =is the least solution of(max rnγ − v n−1γ∈E∗n−1 ∗ + P γ v n) ∨ vη n (2.12)<strong>for</strong> some θ ∈ E∗n−1 that is tr<strong>an</strong>sient on the set of states s <strong>for</strong> which vθs n > vn ηs <strong>an</strong>d that agrees with ηotherwise. More<strong>over</strong>, each such θ is n-optimal.Role of the Three SubproblemsIt seems useful to explain briefly the role of each of the threesubproblems therein. Consider a policy δ that is (n−1)-optimal. Then it is necessary to find a γin ∆ n−1 that maximizes v n γ <strong>over</strong> ∆ n−1 . Un<strong>for</strong>tunately, this is generally hard to do <strong>for</strong> two reasons.First, it is difficult to find ∆ n−1 . Second, even if one has ∆ n−1 , it may not have the product property,which signific<strong>an</strong>tly complicates the problem of optimizing <strong>over</strong> that set. However, it follows from(2.9) that ∆ n−1 has inner <strong>an</strong>d outer approximations Eδ n n−1<strong>an</strong>d Eδ= E∗n−1 , respectively, that areeasy to compute, have the product property <strong>an</strong>d c<strong>an</strong> be used as follows.First solve the maximum-tr<strong>an</strong>sient-value subproblem in (a) to find a policy γ = ζ that maximizesvγ n <strong>over</strong> the γ in E∗n−1 that are tr<strong>an</strong>sient on states where γ differs from δ. The policy ζ also satisfiesthe (n−1)-optimality conditions. Now solve the maximum-reward-rate subproblem in (b) to find apolicy γ = η that maximizes vγ n <strong>over</strong> E n ζ. Since ζ satisfies the (n−1)-optimality conditions ((n−1)-optimality is not enough), it turns out that η is n-optimal on the states where some n-optimalpolicy is recurrent (though it doesn’t find those states). Finally, solve the maximum-tr<strong>an</strong>sient-valuesubproblem in (c) to find a policy γ = θ that maximizes vγnstates where γ differs from η. The policy θ is n-optimal.<strong>over</strong> γ in E n−1∗that are tr<strong>an</strong>sient onThe following corollary is a direct result of the subproblem decomposition of Theorem 2.1.Corollary 2.2 Polynomial Running Time. The problem of finding <strong>an</strong> n-optimal, <strong>an</strong>d hence astrong-present-value optimal, policy is solvable in polynomial <strong>time</strong>.Proof.One may find <strong>an</strong> n-optimal policy by sequentially solving subproblems (a), (b) (withE n ζ = E ζ n ), <strong>an</strong>d (c) of Theorem 2.1 to find <strong>an</strong> m-optimal policy in the order m = −1, 0, . . . , n. Eachof these 3n+5 subproblems is solvable using a linear program whose size is polynomial in the sizeof the <strong>dynamic</strong> program. Khachi<strong>an</strong> [Kha79] shows that a linear program is solvable in <strong>time</strong> that ispolynomial in its size. Also, the size of the solution of a linear program is polynomial in the size ofthe linear program itself. The size of each linear program <strong>for</strong> solving subproblems (a), (b), <strong>an</strong>d (c)depends only on the size of the <strong>dynamic</strong> program <strong>an</strong>d the solutions of the previous linear program.There<strong>for</strong>e, the size of each linear program <strong>for</strong> solving subproblems (a), (b), <strong>an</strong>d (c) is polynomialin the size of the <strong>dynamic</strong> program, so the <strong>time</strong> required to solve each subproblem is polynomial inthe size of the <strong>dynamic</strong> program. There<strong>for</strong>e, the <strong>time</strong> to find <strong>an</strong> n-optimal policy is polynomial in


CHAPTER 2. POLYNOMIAL-TIME METHODS 11the size of the <strong>dynamic</strong> program. Also, since <strong>an</strong> S-optimal policy is strong-present-value optimal, astrong-present-value optimal policy may be found in polynomial <strong>time</strong>. The rest of this section contains a proof <strong>for</strong> Theorem 2.1 <strong>an</strong>d <strong>an</strong> example that demonstrates thatthe three subproblems produce distinct solutions.Proof of Theorem 2.1. Suppose that α is n-optimal. Also, <strong>for</strong> parts (b) <strong>an</strong>d (c), let R ≡ R α(resp., T ≡ T α ) be the set of states where α is recurrent (resp., tr<strong>an</strong>sient). Remember the sets R<strong>an</strong>d T partition S.(a) Suppose δ is (n−1)-optimal. Define a <strong>new</strong> system by first restricting the policies to E∗ n−1 ,letting the one-period rewards be rγ n − v∗n−1 <strong>for</strong> γ ∈ E∗n−1 , <strong>an</strong>d appending a <strong>new</strong> action in everystate s ∈ S that stops <strong>an</strong>d earns the one-period reward vδs n . The policy that stops in each state <strong>an</strong>dperiod is tr<strong>an</strong>sient in the <strong>new</strong> system. Then <strong>for</strong> all γ ∈ ∆, Vαn ≽ Vγ n <strong>an</strong>d, from (2.6), G n γα ≼ 0.Also V n−1δ= Vαn−1 , so <strong>for</strong> γ ∈ E∗n−1 = Eαn−1 , Gγαn−1 = 0 <strong>an</strong>d gγα n ≤ 0. The last fact <strong>an</strong>d vα n ≥ vδnimply that vα n is <strong>an</strong> excessive point of the operator on the right-h<strong>an</strong>d side of (2.10). There<strong>for</strong>e, from[EV75], there is a tr<strong>an</strong>sient policy κ in the <strong>new</strong> system whose value V κ therein is the least solutionof (2.10) <strong>an</strong>d majorizes the value of every other tr<strong>an</strong>sient policy in that system.Next construct a policy ζ ∈ E n−1∗ in the original system with ζ being (n−1)-optimal <strong>an</strong>d v n ζ = V κ.To that end, let E = {s ∈ S : V κs = v n δs }, so V κG ≫ v n δG where G ≡ S \ E. Define ζ by ζ E ≡ δ E <strong>an</strong>dζ G ≡ κ G . Now P δEG = 0, <strong>for</strong> otherwise P δEG > 0 <strong>an</strong>d so V κE = v n δE = rn δE − vn−1 ∗E + P δEEv n δE +P δEG v n δG < rn δE − vn−1 ∗E + P δEEV κE + P δEG V κG ≤ V κE , which is a contradiction. There<strong>for</strong>e, bysuitably permuting the states,P ζ =( )PκGG P κGE0 P δEE<strong>an</strong>d P ∗ ζ =the last since κ is tr<strong>an</strong>sient, so ζ is tr<strong>an</strong>sient on G.δ ∈ E n−1δv n−1ζ= E n−1∗(P∗κGG P ∗ ζGE0 P ∗ δEE)=(0 P∗ζGE0 P ∗ δEE), (2.13)Also from (2.10), the definition of κ <strong>an</strong>d, it follows that ζ ∈ E∗n−1 . These facts <strong>an</strong>d (2.8a) imply that V n−2ζ= V∗n−2 <strong>an</strong>d= Pζ ∗gn ζδ = P ζSE ∗ gn δδE = 0, so ζ is (n−1)-optimal. Now since P ζGG = P κGG is tr<strong>an</strong>sient,− v∗n−1V = V κ is the unique solution ofV =( )rnζG− v n−1∗G+vδEn( )PζGG P ζGEV. (2.14)0 0However, V = vζn also satisfies this equation. To see this <strong>for</strong> states in G, substitute vn−1ζG= vn−1 ∗Ginto gζζG n = 0, while <strong>for</strong> states in E, recall that P ζEG = P δEG = 0 <strong>an</strong>d so vζE n = vn δE. Thus since theabove equation has a unique solution, V κ = v n ζ .Since ζ ∈ E n−1∗= E n−1δ, V n−1ζ= V n−1δby (2.8a), so G n−1γζ= G n−1γδ≼ 0 <strong>for</strong> all γ ∈ ∆ by (2.6).Also, since vζn satisfies (2.10), gn n−1γζ≤ 0 <strong>for</strong> all γ ∈ E∗= E n−1ζ, i.e., <strong>for</strong> all γ with G n−1γζ≼ 0 <strong>for</strong> all γ ∈ ∆, i.e., ζ satisfies the (n−1)-optimality conditions.G n γζ= 0. Hence


CHAPTER 2. POLYNOMIAL-TIME METHODS 12(b) Suppose ζ satisfies the (n−1)-optimality conditions. Now since ∅ ≠ E n ζ ⊆ E ζn <strong>an</strong>d E n ζhas theproduct property, it follows from (2.8a) that (2.11) holds <strong>an</strong>d the sets of maximizers on both sidesthereof coincide. Thus, from [How60] <strong>an</strong>d [Bla62], there is a γ = η maximizing the right-h<strong>an</strong>d, <strong>an</strong>d sothe left-h<strong>an</strong>d, side of (2.11) because the <strong>for</strong>mer is a maximum-reward-rate problem with one-periodrewards rγ n+1 − vζ n <strong>for</strong> γ ∈ E n ζ . Since η ∈ E n ζ ⊆ E ζ n, V ηn−1optimal. Now G n−1αζ= 0, from V n−1α= V n−1ζ= V n−1ζby (2.8a), so η is (n − 1)-(both (n−1)-optimal) <strong>an</strong>d (2.8b), so gαζ n ≤ 0 because ζsatisfies the (n−1)-optimality conditions. Also, (2.8b) gives 0 = vαn−1 − v n−1ζ= Pαg ∗ αζ n = P αSR ∗ gn αζR ,so gαζR n = 0 (since P α ∗ ≥ 0). Hence G n αζR = 0 <strong>an</strong>d there is a λ ∈ E n ζ with λ R = α R <strong>an</strong>d λrecurrent on R because α is recurrent thereon. There<strong>for</strong>e v n λ ≤ vn η by (2.11). Also P αRT = 0, sov n αR = vn λR ≤ vn ηR ≤ vn αR , whence vn ηR = vn αR .(c) Suppose η is (n−1)-optimal <strong>an</strong>d is n-optimal on each state that is recurrent under some n-optimal policy. Then from (a) of the Theorem, there is a θ ∈ E∗n−1 satisfying the assertions of (c) ofthe Theorem except perhaps <strong>for</strong> θ ∈ ∆ n . To show the last fact, observe that since vθ n satisfies (2.12),vα n ≥ vθ n ≥ vn η . More<strong>over</strong>, by hypothesis vηR n = vn αR <strong>an</strong>d so vn θR = vn αR . Also since vn θsatisfies (2.12)<strong>an</strong>d α ∈ ∆ n ⊆ E∗n−1 , vθ n ≥ rn α − v∗n−1 + P α vθ n = vn α + P α (vθ n − vn α) implying vθ n − vn α ≥ P α (vθ n − vn α).Iterating this inequality yields vθ n − vn α ≥ Pα N (vθ n − vn α) <strong>for</strong> N = 0, 1, . . . . Taking the (C, 1) limit ofthe last inequality yields v n θ − v n α ≥ P ∗ α (v n θ − v n α) = P ∗ αSR (v n θR − v n αR) = 0, so θ ∈ ∆ n . The following example illustrates the above ideas <strong>for</strong> n-optimality <strong>an</strong>d that the subproblemsproduce distinct solutions.PSfrag replacementsExample 2.1 Three Subproblems have Distinct Solutions. Consider the deterministic systemin Figure 2.1 with 2m + 3 states labeled σ, 0, 1, . . . , m, 0 ′ , 1 ′ , . . . , m ′ where m ≥ 0. There are twom ′ 0 ′σmu m m−1u m mba2u m m2u m m−10 0b0au m 1u m 02u m 0 2u m 11 ′ 10Figure 2.1: Three Subproblems have Distinct Solutionsactions a <strong>an</strong>d b in states σ <strong>an</strong>d 0, <strong>an</strong>d a single action a in every other state. In state σ, actions a <strong>an</strong>db move the system to states 0 <strong>an</strong>d 0 ′ respectively, <strong>an</strong>d each earns zero one-period reward. Action ain states s <strong>an</strong>d s ′ moves the system to the respective states s + 1 <strong>an</strong>d (s + 1) ′ modulo (m + 1) <strong>an</strong>d


CHAPTER 2. POLYNOMIAL-TIME METHODS 13earns the respective one-period rewards 2u m s <strong>an</strong>d u m s <strong>for</strong> s = 0, . . . , m, where u m s ≡ (−1) s( ms)<strong>for</strong>s = 0, 1, . . . . Finally, action b in state 0 earns zero one-period reward <strong>an</strong>d moves the system to thestopped state.A policy is a pair of actions <strong>for</strong> states σ <strong>an</strong>d 0. For example, ba is the policy that takes actionb in state σ <strong>an</strong>d action a in state 0. Let n = m − 1. Then all four policies are (n−1)-optimal. Ifone starts with the policy δ = ab, solving subproblems (a), (b) <strong>an</strong>d (c) of Theorem 2.1 yields therespective policies ζ = bb, η = ba <strong>an</strong>d θ = aa. Also, v n δ < vn ζ < vn η < v n θ<strong>an</strong>d only θ is n-optimal.2.4 Efficient Implementation via State-ClassificationThe subproblems (a) <strong>an</strong>d (c) are maximum-tr<strong>an</strong>sient-value problems <strong>an</strong>d the subproblem (b) isa maximum-reward-rate problem. One may solve both problems by policy iteration or linear <strong>programming</strong>.This section first describes two state space partitions <strong>an</strong>d then proceeds to implementboth approaches efficiently by partitioning the state space.2.4.1 Recurrent ClassesAs Bather [Bat73, pp. 542–545] shows, it is useful to classify the states of a Markov decision chainas recurrent or tr<strong>an</strong>sient in a m<strong>an</strong>ner that generalizes that <strong>for</strong> Markov chains. In particular call astate recurrent or tr<strong>an</strong>sient according as it is recurrent under some policy or tr<strong>an</strong>sient under everypolicy. The sets of recurrent <strong>an</strong>d tr<strong>an</strong>sient states partition the state space.Bather also gives a recursive procedure <strong>for</strong> finding the set of recurrent states <strong>an</strong>d a partitionthereof that is useful <strong>for</strong> computations. Ross <strong>an</strong>d Varadaraj<strong>an</strong> [RV91, pp. 197–198] show that themembers of this partition are precisely the maximal subsets of recurrent states that <strong>for</strong>m a recurrentclass under some r<strong>an</strong>domized stationary Markov policy. For this reason, call each member of thepartition a recurrent class. 3 Bather [Bat73, pp. 542–545] <strong>an</strong>d Ross <strong>an</strong>d Varadaraj<strong>an</strong> [RV91, pp. 197–198] give different <strong>methods</strong> of finding the recurrent classes in O(S 2 + SA) <strong>time</strong>.Incidentally, there is a second characterization of the recurrent classes in terms of the (nonr<strong>an</strong>domizedstationary Markov) policies considered herein. To describe it, <strong>for</strong>m <strong>an</strong> (undirected)recurrent-state graph whose nodes are the recurrent states <strong>an</strong>d whose arcs are the pairs (s, t) ofstates s, t that both belong to a common recurrent class of some policy. Then it is easy to see thatthe recurrent classes are precisely the node sets of the maximal connected subgraphs (or components)of the recurrent-state graph.3 Ross <strong>an</strong>d Varadaraj<strong>an</strong> [RV91] use the term strongly communicating class <strong>for</strong> what we call a recurrent class.


CHAPTER 2. POLYNOMIAL-TIME METHODS 142.4.2 Communicating ClassesWhile recurrent classes are useful <strong>for</strong> solving the maximum-reward-rate subproblem in (b) solvingthe maximum-tr<strong>an</strong>sient-value subproblems in (a) <strong>an</strong>d (c) of Theorem 2.1, requires a different (wellknown)partition of the state space. It is easiest to explain this in terms of a (directed) graph calledthe state-graph. The nodes of this graph are the states <strong>an</strong>d the stopped state <strong>an</strong>d the arcs are theordered pairs of states (s, t), s ∈ S, <strong>for</strong> which p(t|s, a) > 0 <strong>for</strong> some a ∈ A s <strong>an</strong>d t ∈ S or <strong>for</strong> which tis the stopped state <strong>an</strong>d ∑ t∈Sp(t|s, a) < 1. State s is accessible to state t if there is a chain froms to t in the state graph. A state is accessible to itself with a chain of consisting of only that state.A communicating class (or strong component of the state graph) is a maximal subset of mutuallyaccessible states. The communicating classes partition the state space. Tarj<strong>an</strong>’s [Tar72] depth-firstsearch method c<strong>an</strong> be used to find the communicating classes <strong>an</strong>d the communicating-class graph,i.e., the strong component graph of the state-graph, in O(S + A) <strong>time</strong>. Call a stochastic systemcommunicating if S <strong>for</strong>ms a communicating class.2.4.3 Solving the Maximum-Reward-Rate SubproblemClassifying the states into recurrent classes permits the maximum-reward-rate subproblem in (b)of Theorem 2.1 to be divided into smaller subproblems of the same type. The key to doing this isa felicitous choice of the action sets E n ζs. To that end, first <strong>for</strong>m the restricted system by limitingthe actions in each state s to those in Eζs n . Next use one of the above algorithms to find therecurrent classes <strong>for</strong> the restricted system. Now <strong>for</strong> each state s in some recurrent class C of therestricted system, let E n ζs be the set of actions in E n ζs that keep the system in C, i.e., E n ζs ={a ∈ E n ζs : ∑ c∈C p(c|s, a) = 1 }. For each state s that is tr<strong>an</strong>sient in the restricted system, let E n ζs ={ζ s }. This definition of E n ζassures that the required conditions (i) <strong>an</strong>d (ii) hold. Finally, choosingaction a ∈ E n ζs in state s results in the reward rate rn+1 (s, a) − vζs n . The tr<strong>an</strong>sition rates are theoriginal ones.Since each recurrent class C of the restricted system has access only to states in C with everypolicy in E n ζ, it suffices to solve the maximum-reward-rate subproblem separately on each such classC. Then one common maximizer γ = η of both sides of the subproblem (b) is <strong>for</strong>med by lettingη s = ζ s <strong>for</strong> each state s that is tr<strong>an</strong>sient in the restricted system <strong>an</strong>d, <strong>for</strong> each recurrent class C of therestricted system, letting η C be a maximum-reward-rate policy <strong>for</strong> the restriction of the subproblemto C.To sum up, the above development shows how to divide the maximum-reward-rate subproblem in(b) of Theorem 2.1 (with E n ζdefined as above) into a collection of state-disjoint maximum-rewardrateproblems. In each of these state-disjoint problems, the system is stochastic <strong>an</strong>d the state-spaceis a single recurrent class. It is known <strong>an</strong>d easy to see that a stochastic system has a state-spacethat <strong>for</strong>ms a single recurrent class if <strong>an</strong>d only if the system is communicating. More<strong>over</strong>, much more


CHAPTER 2. POLYNOMIAL-TIME METHODS 15efficient algorithms are available <strong>for</strong> communicating stochastic systems th<strong>an</strong> <strong>for</strong> general ones.Solving a Maximum-Reward-Rate Problem in a Communicating Stochastic System.Consider a communicating stochastic system. Call a policy unichain if its recurrent states <strong>for</strong>m asingle class; naturally, the remaining states are tr<strong>an</strong>sient. It is known from Denardo [Den70b] thatone maximum-reward-rate policy <strong>for</strong> this problem is unichain. A unichain maximum-reward-ratepolicy c<strong>an</strong> be found by linear <strong>programming</strong> or policy iteration.• Linear Programming It is immediate from Denardo [Den70b] that the basic solutions ofM<strong>an</strong>ne’s [M<strong>an</strong>60] <strong>an</strong>d de Ghellinck’s [dG60] linear program are precisely the stationary distributionsof unichain policies. Thus, application of the simplex method, <strong>for</strong> example, to that linear programiterates within the class of unichain policies until one with maximum reward rate is found. M<strong>an</strong>ne’s<strong>an</strong>d de Ghellinck’s linear program has only S + 1 rows <strong>an</strong>d A columns, <strong>an</strong>d so is much smaller th<strong>an</strong>the linear program required <strong>for</strong> general systems, which has, as discussed above, 2S rows <strong>an</strong>d 2Acolumns.• Policy Iteration Haviv <strong>an</strong>d Puterm<strong>an</strong> [HP91] show how to carry out the policy iterationwithin the class of unichain policies. This is much more efficient th<strong>an</strong> use of general policy iteration<strong>for</strong> communicating stochastic systems.2.4.4 Solving the Maximum-Tr<strong>an</strong>sient-Value SubproblemOne c<strong>an</strong> use the communicating-class graph to solve a maximum-tr<strong>an</strong>sient-value problem as follows.Let v ∗ = (vs) ∗ be the desired maximum tr<strong>an</strong>sient value. Choose a communicating class C whosemaximum tr<strong>an</strong>sient values have not been found <strong>an</strong>d that is accessible in one step only to classes<strong>for</strong> which those values have been found. (Note that the tr<strong>an</strong>sient value <strong>for</strong> the stopped state iszero.) Denote by T C the union of the classes to which C has access in one step. Let the one-periodreward associated with each state s ∈ C <strong>an</strong>d each action a in that state be the sum of its one-periodreward <strong>an</strong>d its expected future reward in other classes, i.e., r(s, a) + ∑ t∈T Cp(t|s, a)vt ∗ . Solve themaximum-tr<strong>an</strong>sient-value problem <strong>for</strong> that class. The maximum tr<strong>an</strong>sient values <strong>for</strong> the class arethe desired values vs ∗ <strong>for</strong> each s ∈ C. Repeat this procedure until the maximum tr<strong>an</strong>sient values <strong>for</strong>all classes have been found.2.5 Strongly Polynomial-Time Methods <strong>for</strong> Unique Tr<strong>an</strong>sitionSystemsIn this section we show how to find a strong-present-value optimal policy in strongly polynomial<strong>time</strong> in a unique tr<strong>an</strong>sition system. A unique tr<strong>an</strong>sition system is one in which each state s ∈ S <strong>an</strong>daction a ∈ A s determines a unique t(s, a) such that either


CHAPTER 2. POLYNOMIAL-TIME METHODS 16• t(s, a) ∈ S, p(t(s, a)|s, a) > 0 <strong>an</strong>d p(t|s, a) = 0 <strong>for</strong> t ∈ S \ {t(s, a)}, or• t(s, a) is the stopped state <strong>an</strong>d p(t|s, a) = 0 <strong>for</strong> each t ∈ S.Such systems include st<strong>an</strong>dard deterministic <strong>dynamic</strong> programs. Remember A is the number ofstate-actions pairs (s, a) with s ∈ S <strong>an</strong>d a ∈ A s .A unique tr<strong>an</strong>sition system induces a state-action graph with gains on the arcs. The nodes arethe states <strong>an</strong>d the stopped state. Each arc corresponds to a state-action pair (s, a) with s ∈ S<strong>an</strong>d a ∈ A s , the tail <strong>an</strong>d head of the arc are s <strong>an</strong>d t(s, a) respectively, <strong>an</strong>d the arc’s gain p(s, a) isp(t(s, a)|s, a) if t(s, a) ∈ S <strong>an</strong>d 0 < p(s, a) ≤ 1 if t(s, a) is the stopped state. Note that this graphmay have multiple arcs from one node to <strong>an</strong>other.A policy induces a subgraph of the state-action graph with each state being the tail of a singlearc. Examination of the strong-component graph of the subgraph reveals that the latter is a set ofnode-disjoint circuit-rooted trees that sp<strong>an</strong>s the nodes of the state-action graph. A circuit-rootedtree is a subgraph of the state-action graph whose strong-component graph is a tree with all arcsdirected towards a distinguished root node. The root node is a strong component consisting of eithera simple circuit or the stopped state, <strong>an</strong>d every other strong component consists of a single node.Figure 2.2 illustrates a circuit-rooted tree with the solid nodes <strong>for</strong>ming the simple circuit.Figure 2.2: Circuit-Rooted TreeIt is possible to classify the states under a policy with the aid of its induced subgraph. Calla simple circuit in the subgraph recurrent if the gains on all arcs in the circuit are one; otherwisecall the simple circuit tr<strong>an</strong>sient. A state in the subgraph is recurrent if it lies in a recurrent simplecircuit; otherwise the state is tr<strong>an</strong>sient. A circuit-rooted tree in the subgraph is tr<strong>an</strong>sient if its simplecircuit, if <strong>an</strong>y, is tr<strong>an</strong>sient.Corollary 2.3 Unique Tr<strong>an</strong>sition Systems. In a unique tr<strong>an</strong>sition system, the problem of finding<strong>an</strong> (n − 2)-optimal policy, −1 ≤ n − 2 ≤ S, is solvable in 4(a) Undiscounted O(nS[A ∧ S 2 ] + A) <strong>time</strong> if p(s, a) = 1 <strong>for</strong> each s ∈ S <strong>an</strong>d a ∈ A s ;4 The assumption that n − 2 ≤ S is without loss of generality since if n − 2 > S, a policy is (n − 2)-optimal if <strong>an</strong>donly if it is S-optimal, or equivalently, strong-present-value optimal.


CHAPTER 2. POLYNOMIAL-TIME METHODS 17(b) Discounted O(nS 2 [A ∧ S 2 ] + A) <strong>time</strong> if p(s, a) = β < 1 <strong>for</strong> each s ∈ S <strong>an</strong>d a ∈ A s ; <strong>an</strong>d(c) General O(nS 2 A log S) <strong>time</strong>.Proof. For both (a) <strong>an</strong>d (b), consider a pair of states (s, t) where t is here possibly the stoppedstate <strong>an</strong>d t is accessible from s in one step with some action a ∈ A s . Then, of course, one c<strong>an</strong> choose<strong>an</strong> action a ∈ A s that maximizes r(s, a) subject to t(s, a) = t <strong>an</strong>d eliminate the other actions α ∈ A s<strong>for</strong> which t(s, α) = t. This step requires O(A) <strong>time</strong> <strong>an</strong>d assures that at most O(A ∧ S 2 ) state-actionpairs must be examined, viz., one <strong>for</strong> each pair (s, t) of states.(a) Undiscounted Suppose the hypotheses of part (a) of Corollary 2.3 hold. Then all circuitsin the corresponding collection of circuit-rooted trees are recurrent.Thus each tr<strong>an</strong>sient policycorresponds to a circuit-rooted tree in which the root component is the stopped state. Then themaximum-tr<strong>an</strong>sient-value problem may be solved using the Bellm<strong>an</strong>-Ford method [For56, Bel58] <strong>for</strong>finding a maximum-value rooted tree with root the stopped state.The <strong>time</strong> to find this tree isO(S[A ∧ S 2 ]). An efficient way to do this is to use state-partitioning as discussed in §2.4.The maximum-reward-rate problem may be solved by implementing Bather’s state-partitioningmethod as follows. For each class, use Karp’s [Kar78] algorithm to find a maximum-reward-ratecircuit <strong>an</strong>d then use breadth-first search backwards from the circuit to create a circuit-rooted tree<strong>an</strong>d hence a maximum-reward-rate policy on the class. Finally, solve the system-maximum-rewardrateproblem discussed in §2.4 by the Bellm<strong>an</strong>-Ford method. The <strong>time</strong> to solve both of these problemsis O(S[A ∧ S 2 ]).Thus the <strong>time</strong> to solve each subproblem (a), (b) or (c) of Theorem 2.1 is O(S[A ∧ S 2 ]). Since3n such subproblems must be solved, part (a) of the Corollary follows immediately.(b) Discounted Suppose the hypotheses of part (b) of the Corollary hold. Then the maximumtr<strong>an</strong>sient-valueproblem may be solved with the aid of state-partitioning. The method entails solvinga maximum-tr<strong>an</strong>sient-value problem on each class. Thus, it suffices to consider the case in which thesystem consists of a single class. On such a system <strong>an</strong>d <strong>for</strong> each σ ∈ S, find the maximum-tr<strong>an</strong>sientvalueC σ <strong>over</strong> all circuits that start in σ. Then solve <strong>an</strong> augmented maximum-tr<strong>an</strong>sient-valueproblem where in each state σ ∈ S one also allows stopping <strong>an</strong>d earning the terminal reward C σ .One may find a maximum-tr<strong>an</strong>sient-value circuit that starts in state σ ∈ S as follows. Calculatethe maximum value V Nswhere Vs1is empty. Then,among all N-step chains from s ∈ S to σ as follows:[]Vs N+1 = max r(s, a) + βVt(s,a)N , N ≥ 1 (2.15)t(s,a)∈S= max t(s,a)=σ r(s, a), the maximum being −∞ if the set <strong>over</strong> which a maximum is takenC σ =max1≤N≤SVσN1 − β N = max[VNσ + β N ]C σ . (2.16)1≤N≤S


CHAPTER 2. POLYNOMIAL-TIME METHODS 18Now let UsN be the maximum value among all chains of length N or less that begin at s ∈ S <strong>an</strong>dearn the terminal reward C σ if the chain ends in σ. Then <strong>for</strong> each s ∈ S,U N+1s]= max[r(s, a) + βUt(s,a)N , N ≥ 0 (2.17)a∈A swhere U N t ≡ 0 if t is the stopped state <strong>an</strong>d U 0 s = C s . Then U s ≡ U S s is the maximum-tr<strong>an</strong>sient-valueamong all policies <strong>an</strong>d satisfiesU s = maxa∈A s[r(s, a) + βUt(s,a)].5Furthermore, <strong>an</strong>y policy δ such that δ s = a achieves the maximum on the right-h<strong>an</strong>d side of theabove equation <strong>for</strong> each s ∈ S has maximum tr<strong>an</strong>sient value.The <strong>time</strong> to find δ is O(S 2 [A ∧ S 2 ]). Thus the <strong>time</strong> to solve the subproblem (a) of Theorem 2.1is O(S 2 [A ∧ S 2 ]). Since the system is tr<strong>an</strong>sient, <strong>an</strong>y policy that solves subproblem (a) also solvessubproblems (b) <strong>an</strong>d (c), so the latter two are not needed.solved, part (b) of Corollary 2.3 follows immediately.(c) GeneralSince n such subproblems must beConsider the general case of part (c) of Corollary 2.3. The maximum-tr<strong>an</strong>sientvalue problem c<strong>an</strong> be solved in O(S 2 A log S) <strong>time</strong> as follows. Use the method of Hochbaum <strong>an</strong>dNaor [HN94, pp. 1181–1184, 1186] to find the least excessive point of the optimal return operatorin O(S 2 A log S) <strong>time</strong>. 6 Then from [EV75], the least excessive point is the maximum-tr<strong>an</strong>sient-valuev ∗ ; there is a tr<strong>an</strong>sient policy δ such that v ∗ = r δ + P δ v ∗ ; <strong>an</strong>d each such policy has value v ∗ .As shown in [EV75], one such tr<strong>an</strong>sient policy δ = (δ s ) c<strong>an</strong> be found in O(SA ) <strong>time</strong> as follows.Form a revised state-action graph in two steps. First delete arcs (s, a) in the state-action graph <strong>for</strong>which vs∗ ≠ r(s, a) + p(s, a)vt ∗ (s, a). Next, <strong>for</strong> each state s <strong>for</strong> which there is <strong>an</strong> arc (s, a) in theremaining graph <strong>for</strong> which p(s, a) < 1, append <strong>an</strong> arc (s, a) ′ from s to the stopped state. Then f<strong>an</strong>out backwards from the stopped state in the revised state-action graph to <strong>for</strong>m a maximal subtreewith all arcs directed towards the stopped state. The subtree sp<strong>an</strong>s the states <strong>an</strong>d the stopped statebecause there is a maximum-value tr<strong>an</strong>sient policy δ such that v ∗ = r δ + P δ v ∗ . Then <strong>for</strong> each states, let δ s be the action associated with the unique arc in the subtree leading out of s. The policy δis tr<strong>an</strong>sient since each state is accessible to the stopped state with positive probability under δ.The maximum-reward-rate problem c<strong>an</strong> also be solved in O(S 2 A log S) <strong>time</strong> as follows:1. remove all arcs (s, a) with p(s, a) < 1 to <strong>for</strong>m <strong>an</strong> undiscounted system;2. solve the maximum-reward-rate problem on the undiscounted system as outlined in the proof<strong>for</strong> part (a) above;5 It is possible to cut the running <strong>time</strong> to find U = (U s) using a refinement of Dijkstra’s algorithm. However, thisimprovement does not reduce the order of the leading term of the <strong>over</strong>all running <strong>time</strong>.6 Their method repeatedly calls the Grapevine Algorithm 1 of Aspvall <strong>an</strong>d Shiloach [AS80, pp. 834–835] to checkwhether a given number V s is or is not as large as the maximum tr<strong>an</strong>sient value in state s in O(SA) <strong>time</strong>.


CHAPTER 2. POLYNOMIAL-TIME METHODS 193. restore all arcs, augment the system by allowing stopping in each state with the maximumreward-rate from that state in the undiscounted system, <strong>an</strong>d solve the maximum-tr<strong>an</strong>sientvalueproblem using the method given above.Thus, the <strong>time</strong> to find a maximum-reward-rate policy is O(S 2 A log S).Hence, the <strong>time</strong> to solve each subproblem (a), (b) or (c) of Theorem 2.1 is O(S 2 A log S). Since,by Theorem 2.1, 3n such subproblems must be solved, part (c) of Corollary 2.3 follows. Remark. Hartm<strong>an</strong> <strong>an</strong>d Arguelles [HA99, p. 437], Theorem 17 give <strong>an</strong> algorithm <strong>for</strong> finding amaximum-tr<strong>an</strong>sient-value policy under the hypotheses of (a) of Corollary 2.3 <strong>an</strong>d the assumptionthat the actions in each state consist of choosing the next state to visit. In this event, there are Sactions in each state, so A = S 2 . Under this hypothesis, their method runs in O(S 4 ) <strong>time</strong>. Therunning <strong>time</strong> of the <strong>new</strong> method <strong>for</strong> this case agrees with theirs, but is faster where A < O(S 2 ).2.6 Prior Decomposition MethodsThe idea of decomposing the problem of finding <strong>an</strong> n-optimal policy into subproblems related tothose proposed here has been explored previously in some special cases. Following is a review ofsome of this work <strong>an</strong>d it’s relationship to the approach presented here.−1-Optimality (Maximum-Reward-Rate) Problem. Denardo [Den70b] shows how to solvethe maximum-reward-rate problem in general by solving M<strong>an</strong>ne’s [M<strong>an</strong>60] <strong>an</strong>d de Ghellinck’s [dG60]linear program on a sequence of up to 2n “nested” restrictions of the problem. Here, n is the numberof recurrent classes under the maximum-reward-rate policy <strong>an</strong>d “nested” me<strong>an</strong>s that each restrictionhas a state-space that is a proper subset of its predecessor.Bather [Bat73] finds a maximum-reward-rate policy in two steps. First, he separately considerseach recurrent class together with those tr<strong>an</strong>sient classes that feed directly into that class, finding amaximum-reward-rate policy in the subsystem using a simplified version of Howard’s policy iteration.Then he uses a maximum-tr<strong>an</strong>sient-value problem to decide if it is optimal to stay in a recurrentclass (using the policy found within it’s subsystem), or to move down to <strong>an</strong>other accessible recurrentclass. The resulting policy has maximum reward rate <strong>for</strong> the original system.0-Optimality Problem. Building on <strong>an</strong> idea in [Vei66], Denardo [Den70a] shows how to decomposethe problem of finding a 0-optimal policy into three subproblems. His first subproblem amountsto finding a maximum-reward-rate policy by policy iteration or linear <strong>programming</strong>, <strong>an</strong>d so is quitedifferent from subproblem (a) <strong>for</strong> finding a 0-optimal policy. But his second two subproblems areclose to subproblems (b) <strong>an</strong>d (c), the difference being minor if Blackwell’s [Bla62] policy iterationis used to solve these subproblems <strong>an</strong>d being greater if linear <strong>programming</strong> is used.


CHAPTER 2. POLYNOMIAL-TIME METHODS 20n-Optimality Problem.Two prior decompositions <strong>for</strong> tr<strong>an</strong>sient <strong>an</strong>d irreducible systems <strong>an</strong>d adecomposition outline <strong>for</strong> the general case currently exist.• Tr<strong>an</strong>sient Systems For tr<strong>an</strong>sient systems, i.e., where every policy is tr<strong>an</strong>sient, results ofVeinott [Vei69, pp. 1638, 1648–1649] imply that it is possible to decompose the problem of finding<strong>an</strong> n-optimal policy into a sequence of n+1 subproblems like subproblem (c) except that one c<strong>an</strong>omit the vector vη n . This is possible in tr<strong>an</strong>sient systems because, as he shows, ∆ n−1 = E∗n−1 <strong>an</strong>dθ ∈ E∗n−1 is n-optimal if <strong>an</strong>d only if max γ∈En−1 g n ∗ γθ = 0, i.e., vn = vθn is the unique fixed point of(2.12). This result also follows from Theorem 2.1 on observing that a policy θ that solves subproblem(c) <strong>for</strong> n also solves subproblems (a) <strong>an</strong>d (b) <strong>for</strong> n+1.• Irreducible Systems For irreducible systems, i.e., where every policy is irreducible, Avrachenkov<strong>an</strong>d Altm<strong>an</strong> [AA98] show how to decompose the problem of finding <strong>an</strong> n-optimal policy into a sequenceof n+2 subproblems like subproblem (b). This result also follows from Theorem 2.1 onobserving that a policy η that solves subproblem (b) <strong>for</strong> n also solves subproblem (c) <strong>for</strong> n <strong>an</strong>dsubproblem (a) <strong>for</strong> n+1.2.7 ContributionsThis chapter presents a <strong>new</strong> three-step decomposition <strong>for</strong> finding strong-present-value optimal policiesin substochastic systems. None of the three steps require a specific methodology, so the entiredecomposition is independent of the solution <strong>methods</strong> used at each step. If linear <strong>programming</strong> isused to solve each step of the decomposition, the entire process runs in polynomial <strong>time</strong> with respectto the size of the <strong>dynamic</strong> program. Also, this chapter shows how to use two different state spacepartitions (recurrent <strong>an</strong>d communicating classes) to solve each step efficiently (again, no particularsolution method is specified). Finally, this chapter defines unique tr<strong>an</strong>sition systems <strong>an</strong>d shows thatthe decomposition presented here may solve in strongly polynomial <strong>time</strong> (with respect to the size ofthe <strong>dynamic</strong> program) <strong>for</strong> these systems, with the exact running <strong>time</strong> depending on the propertiesof the system (<strong>an</strong>d the specialized <strong>methods</strong> used to solve each step of the decomposition).Acknowledgement The work contained within this chapter is joint work with Prof. Arthur F.Veinott, Jr.


Chapter 3Numerically Stable Present-ValueExp<strong>an</strong>sion3.1 IntroductionOne of the most import<strong>an</strong>t steps in implementing <strong>an</strong>y method is ensuring the method remains validunder finite precision. Both Miller <strong>an</strong>d Veinott’s [MV69] <strong>an</strong>d Veinott’s [Vei69] policy improvement<strong>for</strong> substochastic systems <strong>an</strong>d Rothblum’s [Rot75] extended policy improvement <strong>for</strong> normalizedsystems require the computation of Laurent exp<strong>an</strong>sion coefficients via a set of linear equations. Ifthese coefficients are not computed to a high degree of accuracy, then these policy improvement<strong>methods</strong> may be invalidated. This chapter presents a method <strong>for</strong> solving the linear equations (thuscomputing the coefficients) <strong>for</strong> systems with substochastic classes (a subset of normalized systems).Section 3.2 presents the DP <strong>for</strong>mulation, describes the systems considered in this chapter, <strong>an</strong>dgives the modified Laurent exp<strong>an</strong>sion <strong>an</strong>d linear equations <strong>for</strong> these systems. Then, Section 3.3 showshow to use a the classes of a policy to decompose the problem of solving the linear equations <strong>for</strong> apolicy into m<strong>an</strong>y problems that solve the linear equations within a substochastic, irreducible class.Section 3.4 shows how to per<strong>for</strong>m numerically stable computations to solve these linear equationswithin a substochastic, irreducible class. This provides the necessary building block to compute thecoefficients <strong>for</strong> the entire system in a numerically stable way. Finally, the question of numericalstability is addressed in detail in §3.5 <strong>an</strong>d the contributions of this chapter are summarized in §3.6.21


CHAPTER 3. NUMERICALLY STABLE METHOD 223.2 Preliminaries3.2.1 FormulationConsider a general system observed in periods 1, 2, . . . . The system exists in a finite set S of S states.In each state s ∈ S the system takes one of a finite set A s of actions. The number of state-actionpairs is given by A. Taking action a ∈ A s from s ∈ S earns reward r(s, a) <strong>an</strong>d causes a tr<strong>an</strong>sition tostate t ∈ S with rate p(t|s, a).Each (stationary) policy δ ∈ ∆ is a function that assigns a unique action δ s ∈ A s to each s ∈ S,<strong>an</strong>d induces a single-period S-column reward vector r δ ≡ (r(s, δ s )) <strong>an</strong>d <strong>an</strong> S × S tr<strong>an</strong>sition matrixP δ ≡ (p(t|s, δ s )).3.2.2 Systems with Substochastic ClassesThe matrix P δ defines a Markov chain, complete with recurrent <strong>an</strong>d tr<strong>an</strong>sient classes under δ. Ifevery class C (under δ) has 0 ≤ p(t|s, a) ≤ 1 <strong>an</strong>d ∑ t∈C p(t|s, a) ≤ 1, s, t ∈ C, a ∈ A s, then δ hassubstochastic classes. Also, since the blocks of P δ corresponding to these classes lie on the diagonal<strong>an</strong>d have spectral radius not exceeding one, P δ has spectral radius not exceeding one. If every(stationary) policy has substochastic classes, then the system has substochastic classes. Also, sincethe spectral radius of the tr<strong>an</strong>sition matrix <strong>for</strong> every (stationary) policy does not exceed one, thesystem is normalized. Hereafter, this chapter only considers system with substochastic classes.Blackwell’s [Bla62] existence theorem holds <strong>for</strong> normalized systems, so it suffices to restrictattention to stationary policies throughout this chapter. Rothblum [Rot75] shows how to extend thepolicy improvements from Miller <strong>an</strong>d Veinott [MV69] <strong>an</strong>d Veinott [Vei69] (<strong>for</strong> substochastic systems)to normalized systems. The only differences arise in the Laurent exp<strong>an</strong>sion <strong>an</strong>d the linear equationsneeded to find the coefficients of this exp<strong>an</strong>sion.3.2.3 Laurent Exp<strong>an</strong>sionFor each policy δ, let the degree of δ be the smallest nonnegative integer d δ ≡ i such that Q i δ <strong>an</strong>dQ i+1δhave the same null space (where Q δ ≡ P δ − I). Let the system degree d ≡ max δ∈∆ d δ . Fornormalized systems, the Laurent exp<strong>an</strong>sion becomesIf V n+d ≡ ( v −d , . . . , v n+d) is a solution ofV ρδ = ∞ ∑n=−dρ n v n δ . (3.1)r j δ + Q δv j = v j−1 , j = −d, . . . , n + d, (3.2)


CHAPTER 3. NUMERICALLY STABLE METHOD 23where r 0 (s, a) ≡ r(s, a), r j (s, a) ≡ 0 <strong>for</strong> j ≠ 0, r j δ≡ ( r j (s, δ s ) ) <strong>an</strong>d v −d−1δ= 0, then V n ≡( ) (v−dδ, . . . , vδ n = v −d , . . . , v n) . The variables v n+1 , . . . , v n+d are not uniquely determined.The remainder of this chapter presents a numerically stable method <strong>for</strong> finding Vδn <strong>for</strong> a givenδ ∈ ∆ <strong>an</strong>d −d ≤ n. More generally, the method finds a solution ofc j + Q δ v j = v j−1 , j = m + 1, . . . , n + d (3.3)given δ ∈ ∆, v m <strong>an</strong>d c j , j = m + 1, . . . , n + d.determined.Again, only v j , j = m + 1, . . . , n are uniquely3.3 Using the Class StructureGiven a policy δ (in a system with substochastic classes), the tr<strong>an</strong>sition matrix P δ not only defines(recurrent <strong>an</strong>d tr<strong>an</strong>sient) classes, but also induces a dependence partial ordering amongst the classes.In substochastic systems, the only classes that don’t depend on <strong>an</strong>y other class are recurrent. Allother classes are tr<strong>an</strong>sient. Veinott solves <strong>for</strong> the Laurent coefficients by solving the linear equationswithin each recurrent class separately (accounting <strong>for</strong> the singularity present in these classes) <strong>an</strong>dusing these coefficients to solve <strong>for</strong> the rest of the system (that is tr<strong>an</strong>sient). However, in systemswith substochastic classes, recurrent classes may depend on other (recurrent or tr<strong>an</strong>sient) classes.Let D 0 be the set of classes that are independent, i.e., don’t depend on <strong>an</strong>y other classes. Let D ibe the set of classes that depend only on classes in ∪ i−1j=0 D j <strong>an</strong>d let S i ≡ { s ∈ C ∈ ∪ i−1j=0 D j}.For each class C in D i , consider (3.2) <strong>for</strong> s ∈ C. The dependence partial ordering shows thatp(t|s, δ s ) > 0 only <strong>for</strong> t in C ∪ S i . If one has already calculated Vtn+d ≡ ( Vδt n,vn+1 t , . . . , vtn+d )<strong>for</strong>t ∈ S i , then solving (3.2) within C reduces to solving (3.3) within C, with m ≡ −d−1, v −d−1δ≡ 0 <strong>an</strong>dcs j ≡ r j (s, δ s ) + ∑ t∈S ip(t|s, δ s )v j t <strong>for</strong> s ∈ C. By solving (3.3) within the classes in D i , i = 0, 1, . . . , inorder one may solve (3.2) <strong>for</strong> the entire system efficiently. Also, this is done by solving (3.3) withinsubstochastic, irreducible classes.3.4 Substochastic, Irreducible ClassesThis section presents a method <strong>for</strong> solving the linear system (3.3) within a single (substochastic,irreducible) class. Throughout, the notation <strong>for</strong> system parameters denotes those same parametersrestricted to the class, e.g., S refers to the states within the class, P δ refers to the tr<strong>an</strong>sition matrixrestricted to the states in the class, etc.Q −1δEach class may be either tr<strong>an</strong>sient or recurrent. In a tr<strong>an</strong>sient class, lim N→∞ P N δ= 0, so≡ (P δ − I) −1 = − ( I + P δ + P 2 δ + · · · ) is well-defined (i.e., Q δ is non-singular). If the class isrecurrent, then it must be stochastic, i.e., ∑ t∈S p(t|s, δ s) = 1 <strong>for</strong> every s ∈ S. Then the rows ofQ δ sum to zero, so Q δ is singular. Since the class is irreducible, eliminating <strong>an</strong>y (single) state s


CHAPTER 3. NUMERICALLY STABLE METHOD 24from the class causes the remainder to become tr<strong>an</strong>sient. This is equivalent to removing the rowcorresponding to the state-action pair (s, δ s ) <strong>an</strong>d the column corresponding to s from P δ . Removingthis row <strong>an</strong>d column from Q δ causes it to become non-singular. There<strong>for</strong>e, Q δ has r<strong>an</strong>k S − 1, i.e.,it is r<strong>an</strong>k-deficient by one.3.4.1 R<strong>an</strong>k-revealing LU factorsGiven a (substochastic, irreducible) class, one may deduce if it is tr<strong>an</strong>sient or recurrent by me<strong>an</strong>sof a r<strong>an</strong>k-revealing LU (RRLU) factorization of Q δ . More<strong>over</strong>, regardless of the nature of the class,the LU factors may still be used to solve (3.3).The RRLU takes the <strong>for</strong>m( ) ( ) ( )ˆQ ˆq L 0 U uT 1 Q δ T2 T ≡ T 1 T Tq T 2 =,ϕ l T 1 0 εwhere T 1 <strong>an</strong>d T 2 are permutations, ˆQ = LU is <strong>an</strong> (S − 1) × (S − 1) non-singular matrix, q Tl T are (S − 1)-row vectors, ˆq <strong>an</strong>d u are (S − 1)-column vectors, <strong>an</strong>d ε is a valuable indicator if thepermutations are chosen correctly (according to the Threshold Complete Pivoting strategy describedin Chapter 4). If |ε| is suitably large Q δ is taken as full r<strong>an</strong>k, but if |ε| = O(ɛ), where ɛ is the machineprecision, Q δ is regarded as singular. In particular, it is r<strong>an</strong>k-deficient by one.asFor simplicity, assume the permutations T 1 <strong>an</strong>d T 2 are the identity <strong>an</strong>d write the LU factorization( ) ( ) ( )ˆQ ˆq L 0 U uQ δ ≡ =.q T ϕ l T 1 0 εRegardless of the nature of the class, the LU factors may be used to solve (3.3) because the systemis consistent even if Q δ is singular.<strong>an</strong>d3.4.2 Tr<strong>an</strong>sient classesIf the RRLU factorization shows that Q δ has full r<strong>an</strong>k, then the linear system (3.3), equivalently⎡Q δ−I⎢⎣Q δ. .. . ..−I⎤ ⎡ ⎤ ⎡ ⎤v m+1 v m − c m+1v m+2−c m+2⎥ ⎢=⎦ ⎣ . ⎥ ⎢, (3.4)⎦ ⎣ . ⎥⎦Q δ v n+d−c n+dis nonsingular <strong>an</strong>d block tri<strong>an</strong>gular. It may there<strong>for</strong>e be solved sequentially by block <strong>for</strong>ward substitution:Q δ v j = v j−1 − c j (3.5)


CHAPTER 3. NUMERICALLY STABLE METHOD 25in order <strong>for</strong> j = m + 1, . . . , n + d (using the full LU factors of Q δ ).3.4.3 Recurrent classesIf the RRLU factorization shows that Q δ is r<strong>an</strong>k-deficient by one, then the linear system (3.3) maybe represented in matrix <strong>for</strong>m⎡ˆQq T−I⎢⎣ˆqϕ−1ˆQq Tˆqϕ. .. . ..−I−1ˆQq T⎤ ⎡ˆv m+1S ..ˆq ⎥ ⎢⎦ ⎣ϕv m+1Sˆv m+2v m+2ˆv n+dv n+dS⎤ ⎡ ⎤ˆv m − ĉ m+1vSm − cm+1S−ĉ m+2=−c m+2Ṣ, (3.6).⎥ ⎢⎦ ⎣ −ĉ n+d ⎥⎦−c n+dSwhere ˆv j <strong>an</strong>d ĉ j represent the first S − 1 elements of v j <strong>an</strong>d c j respectively. Rearr<strong>an</strong>ging (3.6) gives⎡⎤ ⎡ˆQˆq−I ˆQ ˆq. .. . .. . ..−I ˆQ ˆqq Tϕq T −1 ϕ⎢ . .. . .. . .. ⎥ ⎢⎣⎦ ⎣q T −1 ϕSince Q δ is r<strong>an</strong>k-deficient by one, the last row of the systemˆv m+1ˆv m+2.ˆv n+dv m+1Sv m+2S...v n+dS⎤ ⎡⎤ˆv m − ĉ m+1−ĉ m+2.=−ĉ n+dvSm − .cm+1 S−c m+2Ṣ⎥ ⎢⎦ ⎣ .⎥⎦−c n+dSQ δ v m+1 = v m − c m+1 ,i.e., q T ˆv m+1 + ϕv m+1S= vS m − cm+1 S, is redund<strong>an</strong>t <strong>an</strong>d may be omitted. Also, since v n+d is notdetermined uniquely, one may assign v n+dS≡ 0 <strong>an</strong>d eliminate the column corresponding to this


CHAPTER 3. NUMERICALLY STABLE METHOD 26variable. The remaining system⎡⎤⎡ˆQˆq−I ˆQ ˆq. .. . .. . ..−I ˆQ ˆq−I ˆQq T −1 ϕ. .. . .. . ..⎢⎣q T −1 ϕ ⎥⎢⎦⎣q T−1ˆv m+1ˆv m+2.ˆv n+d−1ˆv n+dv m+1Sv m+2S..v n+d−1S⎤ ⎡⎤ˆv m − ĉ m+1−ĉ m+2.−ĉ n+d−1=−ĉ n+d−c m+2Ṣ.⎥ ⎢⎦ ⎣ −c n+d−1 ⎥S ⎦−c n+dS(3.6’)has a unique solution (as it has full r<strong>an</strong>k). Label the various components shown in (3.6’) as[A BCD] [ ] [ ]ˆv ĉ= . (3.7)v S c SSince ˆQ is nonsingular (<strong>an</strong>d ˆQ = LU has already been found) it is easy to solve a system Ax = bsequentially. Hence a block LU factorization,[ ] [ ] [A B A I Y=,C D C I Z]solves the full system efficiently. First, solve AY = B <strong>an</strong>d calculate the Schur complement Z =D − CY , then use block-<strong>for</strong>ward <strong>an</strong>d block-backward substitution, i.e., solve[AC I] [ ] [ ]ˆx ĉ=x S c S<strong>an</strong>d[I YZ] [ ] [ ]ˆv ˆx= ,v S x Sto solve (3.6’). The remainder of this section breaks down each step of the process.Solving AY = B.Exp<strong>an</strong>ding A <strong>an</strong>d B into matrix <strong>for</strong>m gives⎡ˆQ−I⎢⎣ˆQ. .. . ..−IˆQ−I⎤ ⎡ˆqY =⎥ ⎢⎦ ⎣ˆQˆq⎤. .. .⎥ˆq ⎦


CHAPTER 3. NUMERICALLY STABLE METHOD 27There<strong>for</strong>e, the first column of Y solves the following block-diagonal system:⎡ˆQ−I⎢⎣ˆQ. .. . ..−IˆQ−I⎤ ⎡⎥ ⎢⎦ ⎣ˆQy m+1y m+2.y n+d−1y n+d⎤ ⎡ ⎤ˆq0=..⎥ ⎢ ⎥⎦ ⎣0⎦0One may solve the system sequentially (cf. §3.4.2), i.e., LUy j = y j−1 in order <strong>for</strong> j = m+1, . . . , n+d(where y m ≡ ˆq). The blocks of the first column of Y give all elements of the following Toeplitz matrix:⎡⎤y m+1y m+2 y m+1Y =. y m+2 . ...⎢.⎣y n+d−1 .. . .. ym+1⎥⎦y n+d y n+d−1 y m+2Forming Z = D − CY .CY has the <strong>for</strong>mSince C is block-diagonal <strong>an</strong>d Y is block-upper tri<strong>an</strong>gular, the product⎡⎤q T y m+2 q T y m+1q T y m+3 q T y m+2 . ..CY =⎢⎣,... .. q T y m+1 ⎥⎦q T y n+d q T y n+d−1 q T y m+2giving⎡⎤ ⎡⎤−1 ϕq T y m+2 q T y m+1. −1 ..q T y m+3 q T y m+2 . ..Z =−⎢ .⎣.. ⎥ ⎢ϕ ⎦ ⎣... .. q T y m+1 ⎥⎦−1 q T y n+d q T y n+d−1 q T y m+2⎡⎤ .1 + q T y m+2 q T y m+1 − ϕq T y m+3 1 + q T y m+2 . ..= −⎢⎣... .. q T y m+1 ⎥− ϕ⎦q T y n+d q T y n+d−1 1 + q T y m+2


CHAPTER 3. NUMERICALLY STABLE METHOD 28However, ˆQy m+1 = ˆq, so q T y m+1 = ϕ (because Q δ is r<strong>an</strong>k-deficient by one), so Z is both lowertri<strong>an</strong>gular <strong>an</strong>d Toeplitz:⎡⎤1 + q T y m+2q T y m+3 1 + q T y m+2Z = −⎢. .⎣ .. ..⎥⎦q T y n+d q T y n+d−1 1 + q T y m+2Block-<strong>for</strong>ward substitution. The solution ˆx of Aˆx = ĉ is found sequentially. Then x S may becalculated directly from x S = c S − C ˆx.Block-backward substitution. If Z is large, the solution of Zv S = x S may be found usingspecial <strong>methods</strong> <strong>for</strong> (lower) tri<strong>an</strong>gular Toeplitz systems [Tre87]. Otherwise (Z is of moderate size)<strong>an</strong>d ordinary <strong>for</strong>ward substitution suffices to find v S . Then, ˆv is obtained directly from ˆv = ˆx−Y v S .3.5 Numerical StabilityRegardless of whether a class is tr<strong>an</strong>sient or recurrent, one will always solve a block-lower tri<strong>an</strong>gularsystem sequentially (block-<strong>for</strong>ward substitution) using a nonsingular matrix Q (≡ Q δtr<strong>an</strong>sient classes, ≡ ˆQ <strong>for</strong> recurrent classes).<strong>for</strong>Since the block-<strong>for</strong>ward substitution is equivalentto using (Q) j−m , j = m + 1, . . . , n + d, <strong>an</strong>y errors in solving with Q will grow exponentially. IfQ is ill-conditioned, one may reduce this effect by keeping the interval n + d − m in (3.3) small.Implementing policy improvement as suggested by Veinott [Vei69], i.e., finding m-optimal policies<strong>for</strong> m = −d, . . . , n sequentially, gives V m−1δthroughout <strong>an</strong>d one only searches <strong>for</strong> m, . . . , m + d-improvements (again in order). Finding <strong>an</strong> (m + d)-improvement requires V m+dδ, so the interval <strong>for</strong>(3.3) is (at most) m + 2d − (m − 1) = 2d + 1. There<strong>for</strong>e, the smaller the degree of the system, themore reliable calculations become.The RRLU is essential in this method. If a class is recurrent, but the LU fails to identify thesingularity in Q, then Q will be extremely ill conditioned <strong>an</strong>d computational errors may becomeprominent.For recurrent classes, the <strong>methods</strong> works with the block LU factorization (3.7). The factorizationis stable as long as A is not almost singular <strong>an</strong>d the elements of either B or C (or both) are notmuch larger th<strong>an</strong> the biggest element of A.The Schur complement Z is lower tri<strong>an</strong>gular with the const<strong>an</strong>t diagonal element φ Z ≡ 1+q T y m+2 .The condition of Z will be reasonable if φ Z is not close to zero <strong>an</strong>d not signific<strong>an</strong>tly smaller th<strong>an</strong>the off-diagonal elements q T y m+3 , · · · , q T y n+d .


CHAPTER 3. NUMERICALLY STABLE METHOD 293.6 ContributionsGiven a policy in a system with substochastic classes, this chapter presents a <strong>new</strong> method <strong>for</strong>finding Laurent exp<strong>an</strong>sion coefficients (<strong>for</strong> the present value of the policy). A r<strong>an</strong>k-revealing LUfactorization indicates if each (communicating) class is tr<strong>an</strong>sient or recurrent. For tr<strong>an</strong>sient classes,the LU factors provide a numerically stable method <strong>for</strong> finding the coefficients. For recurrent classes,this chapter gives a matrix decomposition based on well defined submatrices of the LU factors tofind the coefficients. Finally, this chapter discusses where numerical instability may arise duringcomputation of the coefficients.Acknowledgement The work contained within this chapter is joint work with Prof. Michael A.Saunders.


Chapter 4On a R<strong>an</strong>k-Revealing Sparse LUFactorization4.1 IntroductionThis chapter presents a r<strong>an</strong>k-revealing LU (RRLU) factorization <strong>for</strong> general sparse matrices (whichmay be square or rect<strong>an</strong>gular). Although dense RRLU factorizations exist [Cha84, HLY92], as dosparse LU factorizations ([DR96, DEG + 99] <strong>an</strong>d m<strong>an</strong>y others), no previous sparse LU factorizationshave focused on r<strong>an</strong>k-revealing properties. The method presented here is implemented as a <strong>new</strong>option within LUSOL, the sparse LU factorization described by Gill, Murray, Saunders, <strong>an</strong>d Wright[GMSW87]. LUSOL uses a Markowitz pivot strategy to maintain sparsity, along with thresholdpartial pivoting (TPP) to preserve stability. The <strong>new</strong> option implements threshold complete pivoting(TCP). While requiring relatively few ch<strong>an</strong>ges to the code, it provides signific<strong>an</strong>t rewards.In particular, the stricter stability test is guar<strong>an</strong>teed to move the smaller pivots (if <strong>an</strong>y) to theend of the factorization. This is sufficient to reveal the r<strong>an</strong>k of most matrices except <strong>for</strong> the bestknownpathological example. The TCP factorization also tends to preserve sparsity well. The mainchallenge is to stay as near as possible to the linear running <strong>time</strong> that is usually achieved by theconventional TPP factorization.Section 4.2 describes various <strong>for</strong>ms of LU factorization, including the original TPP factorization<strong>an</strong>d the modifications made to LUSOL to implement the TCP pivoting strategy. Section 4.3 illustratesthe r<strong>an</strong>k-revealing property of the <strong>new</strong> factorization on some small classical test cases. Next,§4.4 describes experiments on m<strong>an</strong>y sparse examples from a <strong>dynamic</strong> <strong>programming</strong> application. Thecomplexity appears to be polynomial but less th<strong>an</strong> quadratic in the number of nonzeros in the LUfactors. Section 4.5 describes the usefulness of the <strong>new</strong> factorization within optimization codes.Finally, §4.6 summarizes the contributions within this chapter.30


CHAPTER 4. RANK-REVEALING SPARSE LU 314.2 LU FactorizationThe LU factors of a matrix A may be written as A = LU = ∑ k l k uT k , where l k <strong>an</strong>d u T kcolumns of L <strong>an</strong>d the rows of U. With A (1) = A, the kth factorization step producesare theA (k+1) = A (k) − l k u T k ,in which some nonzero element A (k)ij is chosen as “pivot”, <strong>an</strong>d the associated row i <strong>an</strong>d column jbecome zero. (Thus, A (k+1) contains k empty rows <strong>an</strong>d columns.) For simplicity k is omitted <strong>an</strong>deach step is written asl = A .j /A ij , u T = A i. , A ← A − lu T ,The resulting L <strong>an</strong>d U are permuted tri<strong>an</strong>gles. In pivot order, the diagonals of L are 1 <strong>an</strong>d thepivots A ij become the diagonals of U.The process is stable if the elements of A do not grow large at <strong>an</strong>y stage, <strong>an</strong>d the factors tend tobe sparse if the r<strong>an</strong>k-one updates lu T are <strong>for</strong>med from sparse pivot rows <strong>an</strong>d columns. In practice,one attempts to control element growth by bounding the elements of L. The following terms areneeded (where “A” still me<strong>an</strong>s A (k) ):FactorTol A stability toler<strong>an</strong>ce <strong>for</strong> controlling the magnitude of |L ij |. Pivots are chosen so that1 ≤ ‖l‖ ∞ ≤ FactorTol ≤ 100 (say).DropTol A toler<strong>an</strong>ce <strong>for</strong> ignoring negligible entries. Updated elements are treated as zero if |A ij |


CHAPTER 4. RANK-REVEALING SPARSE LU 324.2.1 Dense Partial <strong>an</strong>d Complete PivotingFor a dense n×n matrix, LU factorization requires about 1 3 n3 arithmetic operations. Partial pivotingis usually employed with FactorTol = 1 (since sparsity is not <strong>an</strong> issue). Finding the largest elementin each pivot column requires O(n 2 ) comparison operations in total (negligible compared to thearithmetic operations).Complete pivoting is known to be numerically preferable, as it is even more likely to preventelement growth. Finding the largest element in all remaining rows <strong>an</strong>d columns requires about 1 3 n3comparisons, almost doubling the cost of dense LU factorization. (Note that in the sparse case, o<strong>new</strong>ould be delighted if complete pivoting cost only twice as much as sparse partial pivoting.)Businger [Bus71] has shown how to avoid most of the extra cost by switching from partialto complete pivoting only in rare circumst<strong>an</strong>ces: when the elements of u k become too large, notcounting the pivots. (This involves monitoring the off-diagonal elements of the current rows of U.)For dense matrices only O(n 2 ) comparisons are required. Businger’s approach also seems suitablein the sparse case (to ensure stability). It will be considered <strong>for</strong> future implementation.4.2.2 Stability <strong>an</strong>d R<strong>an</strong>k DetectionBy definition, a stable method computes LU factors <strong>for</strong> <strong>an</strong>y matrix, regardless of condition orsingularity, with the relation A = LU holding to high accuracy in all cases. If A is ill-conditioned,at least one of L or U must be also. For example, consider n × n matrices of the <strong>for</strong>m⎛⎞0.1 10.1 1B n =· ·,⎜⎟⎝0.1 1 ⎠0.1<strong>for</strong> which cond(B n ) ≈ 10 n with only one small singular value. For n ≥ 15, B n is essentially singular<strong>an</strong>d r<strong>an</strong>k(B n ) = n − 1. Partial pivoting exhibits perfect stability by returning L = I <strong>an</strong>d U = B n .Complete pivoting c<strong>an</strong>not be more stable, but by disallowing the diagonal 0.1 pivots, it ultimatelyreturns a U with one small diagonal U nn , clearly indicating the r<strong>an</strong>k.Note that Matlab’s condest [HT00] correctly estimates cond(B 15 ) ≈ 10 15 <strong>an</strong>d returns a vector v<strong>for</strong> which B 15 v = O(ɛ). However, the elements of v are smoothly graded: v = [0.9, −0.09, 0.009, . . . ,−9 × 10 −14 , 9 × 10 −15 ] T , giving no indication of the source or degree of singularity. For the applicationsdiscussed later, the requirements are these:• An RRLU factorization should be stable in the sense that A = LU holds to high accuracy, <strong>an</strong>dsparse whenever A is sparse.• The factors should indicate which columns of A could be removed (as few as possible) toimprove the condition signific<strong>an</strong>tly.


CHAPTER 4. RANK-REVEALING SPARSE LU 334.2.3 Threshold Partial Pivoting (TPP)Threshold pivoting <strong>methods</strong> bal<strong>an</strong>ce stability <strong>an</strong>d sparsity by using FactorTol > 1. For TPP theMarkowitz strategy suggests sparse pivots A ij <strong>an</strong>d the stability test requires|A ij | ≥ A j max / FactorTol, (4.1)where A j max is the largest element in column j. This is the strategy used by traditional sparse LUpackages, including LA05, MA28, MA48, Y12M, LUSOL <strong>an</strong>d MOPS [Rei82, Duf77, DR96, ZWS81,GMSW87, SS90] (some of them oriented by rows rather th<strong>an</strong> columns).Typically, nnz(L + U) is 1–3 <strong>time</strong>s as m<strong>an</strong>y as nnz(A) <strong>for</strong> the original A (<strong>an</strong>d rarely as much as10 <strong>time</strong>s). Since A tends to have only 1–10 entries per column <strong>an</strong>d the average Markowitz count M ijtends to be low, the total storage <strong>an</strong>d work required are both typically O(n). Various implementationstrategies have been proposed to achieve this efficiency. In particular:S1: Zlatev [Zla80] searches only a few of the sparsest rows <strong>an</strong>d columns of the updated A <strong>for</strong>suitable pivots.S2: Suhl <strong>an</strong>d Suhl [SS90] store the largest element of each column of the updated A at the beginningof the column’s data structure, to facilitate the stability test.Both strategies are used in the current version of LUSOL. Strategy S2 is especially import<strong>an</strong>t <strong>for</strong> ourr<strong>an</strong>k-revealing implementation. Note that the largest element must be found in each modified column(defined above), but the total searching involved in Strategy S2 tends to be linear in nnz(L + U),as desired.4.2.4 Threshold Rook Pivoting (TRP)A stricter stability test would be|A ij | ≥ A j max / FactorTol <strong>an</strong>d |A ij | ≥ A i max / FactorTol, (4.2)where A i max is the largest element row i. This threshold rook pivoting (TRP) strategy has beenimplemented by Gupta [Gup00] in the WSMP package <strong>for</strong> judicious use when the preferred TPPstrategy fails. For the dense case with FactorTol = 1, Foster [Fos97] has shown that rook pivotingprevents exponential growth in the size of |A ij |, <strong>an</strong>d suggests that it has r<strong>an</strong>k-revealing properties.The threshold <strong>for</strong>m there<strong>for</strong>e warr<strong>an</strong>ts future study. In the LUSOL context, TRP would complicatethe Markowitz strategy (perhaps requiring the nonzeros to be stored by rows as well as columns),but it may involve less searching <strong>over</strong>all th<strong>an</strong> the TCP strategy described next.


CHAPTER 4. RANK-REVEALING SPARSE LU 344.2.5 Threshold Complete Pivoting (TCP)This is the <strong>new</strong> option implemented in LUSOL. The stability test is|A ij | ≥ A max / FactorTol, (4.3)where A max is the largest element in the current A. While the ch<strong>an</strong>ges in the pivot selection schemeare conceptually minor, they allow the r<strong>an</strong>k of the matrix to be accurately revealed by moving smallelements to the later pivot positions. The Markowitz strategy becomes slightly simpler, but maycontinue longer be<strong>for</strong>e <strong>an</strong> acceptable pivot is found. There are two signific<strong>an</strong>t costs:• Decreased sparsity in L <strong>an</strong>d U compared to TPP.• The need to maintain A max at each stage.The first cost seems inevitable, but Strategy S2 helps with the second. With the largest element ofeach column known, A max could be found at stage k by searching n − k elements. However, thisapproach entails O(n 2 ) comparisons in total. To economize, after each elimination we findA mod ≡ the largest element in the modified columns<strong>an</strong>d then consider four cases. Suppose column j max contains A max be<strong>for</strong>e the elimination.1. If column j max was modified (or deleted as pivot column),1a. if A max ≤ A mod then A max ← A mod1b. else search all columns <strong>for</strong> a <strong>new</strong> A max2. else2a. if A max < A mod then A max ← A mod2b. else A max remains the same.Only case 1b requires a search of n−k elements. The complexity now depends on how often this casearises, i.e., how often does A max decrease? For matrices with m<strong>an</strong>y nonzeros of equal magnitude,it may not be often. Otherwise, it may be rather often, especially when FactorTol is rather low.For the sparse test matrices of §4.4, the total work appears to be less th<strong>an</strong> O(n 2 ) but signific<strong>an</strong>tlymore th<strong>an</strong> O(n). However, <strong>for</strong> those results the implementation avoided a full search only in case 1a.More detailed experiments are needed with the current implementation on a larger r<strong>an</strong>ge of sparseexamples.4.2.6 Tri<strong>an</strong>gular PreprocessingAs in several other codes, LUSOL’s Markowitz strategy has the effect of isolating a “backwardtri<strong>an</strong>gle” U 1 <strong>an</strong>d a “<strong>for</strong>ward tri<strong>an</strong>gle” L 1 be<strong>for</strong>e <strong>an</strong>y elimination is per<strong>for</strong>med on the remaining


CHAPTER 4. RANK-REVEALING SPARSE LU 35block B. The structure revealed is of the <strong>for</strong>m❅❅❅ U 1❅❅❅❅❅❅❅❅❅L 1V❅❅❅❅MB.With the normal TPP partial pivoting option, LUSOL accepts all rows of U 1 <strong>an</strong>d V as they come,without numerical tests on the pivots. (On the other h<strong>an</strong>d, columns of L 1 <strong>an</strong>d M must satisfy test(4.1) to ensure that L is well-conditioned <strong>for</strong> LUSOL’s updating routines.)For the TCP option, more care is needed with U 1 to obtain <strong>an</strong> RRLU factorization. At first sightit seems that all diagonals of U 1 should be subjected to the complete pivoting test (4.3). However,in optimization algorithms it is common <strong>for</strong> “A” to contain m<strong>an</strong>y unit vectors corresponding toslack variables. Ideally they should find their way into U 1 , but if A contains a nonzero larger th<strong>an</strong>FactorTol, the TCP test would reject them all. Looking more closely, we see that the top rowshave the following structure:❅❅±I U 11V 1❅❅❅❅U 12V 2❅❅.The matrix ±I denotes columns associated with slack variables (<strong>an</strong>d <strong>an</strong>y other unit vectors in A).When the LU factors are used to solve linear systems, each solve with U will involve a system of the


CHAPTER 4. RANK-REVEALING SPARSE LU 36<strong>for</strong>m⎛⎞ ⎛ ⎞ ⎛ ⎞±I U 11 V 1 x b⎜⎝ U 12 V 2⎟ ⎜⎠ ⎝y⎟⎠ = ⎜⎝c⎟⎠ ,U 2 z dwhere z, y <strong>an</strong>d x are determined in that order by back-substitution. As long as the TCP test isapplied to all diagonals after ±I, <strong>an</strong>y singularities there will be detected <strong>an</strong>d (we assume) corrected.Thus, z <strong>an</strong>d y will not be pathologically large <strong>an</strong>d x = ±(U 11 y + V 1 z) will be well defined. (Sincethe matrices U 11 <strong>an</strong>d V 1 are original data, they should not contain extremely large numbers.)4.2.7 The Test <strong>for</strong> SingularityAfter completing a factorization, LUSOL examines the diagonals of U <strong>an</strong>d flags <strong>an</strong>y that are “small”in absolute terms or relative to other elements in the same column of U:|U jj | ≤ Utol 1or |U jj | ≤ Utol 2 × max i |U ij |(4.4)Typical values <strong>for</strong> Utol 1 <strong>an</strong>d Utol 2 are ɛ 2/3 ≈ 3.7 × 10 −11 when the factors are being used to solvelinear systems. For r<strong>an</strong>k determination in the Matlab environment, we have usedUtol 1 = n‖A‖ɛ, Utol 2 = 0,with ‖A‖ obtained from norm or normest <strong>for</strong> dense or sparse A respectively.toler<strong>an</strong>ce Matlab uses <strong>for</strong> determining r<strong>an</strong>k via svd.This matches the4.3 R<strong>an</strong>k-Revealing Tests on Classical ExamplesFor dense matrices, Ch<strong>an</strong> [Cha84], Hw<strong>an</strong>g, Lin, <strong>an</strong>d Y<strong>an</strong>g [HLY92], <strong>an</strong>d Hw<strong>an</strong>g <strong>an</strong>d Lin [HL97]describe two-pass( procedures ) ( <strong>for</strong> detecting ) one or more singularities. These RRLU factorizationsL U11 Ureduce P 1 AP 2 to12, where P 1 <strong>an</strong>d P 2 are permutations <strong>an</strong>d U 22 is close to zero.I U 22The first pass per<strong>for</strong>ms <strong>an</strong> LU factorization using partial pivoting. The second pass uses inverseiteration (involving several solves) to find a r<strong>an</strong>k-revealing permutation. The [HLY92] procedure thenper<strong>for</strong>ms <strong>an</strong>other (modified partial pivoting) LU factorization that preserves this permutation <strong>an</strong>dstops once U 22 is found (leaving U 22 unfactorized). The work required is given as 2 3 n3 + 2Jn 2 r + 2nroperations, where r is the r<strong>an</strong>k-deficiency of the n×n matrix <strong>an</strong>d J is the number of inverse iterationsused.Since TCP is a vari<strong>an</strong>t of LU factorization (producing complete factors L <strong>an</strong>d U), no second passor inverse iteration is needed to reveal the r<strong>an</strong>k of A. For dense matrices, the work required is also


CHAPTER 4. RANK-REVEALING SPARSE LU 37about 2 3 n3 operations.Both [Cha84] <strong>an</strong>d [HLY92] give examples of matrices where normal LU factorizations will notdetect (near) singularities. In this section, the examples are used to compare five different factorizations:MD Matlab Dense LUMS Matlab Sparse LUMCS Matlab Sparse LU with colmmd preprocessingTPP Sparse LU with threshold partial pivotingTCP Sparse LU with threshold complete pivotingAll Matlab factorizations use partial pivoting. MS <strong>an</strong>d MCS have a parameter thresh that correspondsto 1/FactorTol. (See Matlab help <strong>for</strong> lu <strong>an</strong>d colmmd in appendix B.) For these smallexamples the Matlab codes used thresh = 1.0 (the most reliable value), while TPP <strong>an</strong>d TCP usedFactorTol = 1.25 to allow at least a little emphasis on sparsity.The tests here <strong>an</strong>d later were per<strong>for</strong>med on <strong>an</strong> SGI Origin 2000 with 195 MHz R10000 CPUs,running Irix 6.5 <strong>an</strong>d Matlab version 6.0.0.88 (R12) with machine precision ɛ = 2 −52 = 2.2 × 10 −16 .Example 4.1 [Cha84, p. 543]. The matrix T n ≡ T with⎧⎛⎞1 −1 −1 −11 i = j⎪⎨T ij = −1 i < j, 1 ≤ i, j ≤ n , e.g., T 4 =1 −1 −1⎜⎝ 1 −1⎟⎠ ,⎪⎩0 i > j1is nearly singular <strong>for</strong> large n. Ch<strong>an</strong> [Cha84] shows that, <strong>for</strong> a certain permutation of T n , <strong>an</strong> LUfactorization has the nth pivot equal to 2 −(n−2) . Once 2 −(n−2) is less th<strong>an</strong> machine precision, thematrix is effectively singular. Since ɛ = 2 −52 , T 50 has effective r<strong>an</strong>k 49. None of the factorizationstested unc<strong>over</strong>ed the singularity correctly. All returned U with a unit diagonal, except MCS, whichgave U 50,50 = 4.8 × 10 −7 .This is <strong>an</strong> extremely pathological case in being already tri<strong>an</strong>gular with all nonzero elements ofequal magnitude. MD <strong>an</strong>d MS find there is nothing to eliminate, returning L = I, U = T . MCShappens to reorder the columns <strong>an</strong>d carry out some elimination, but gives only a small hint ofsingularity.Some column re-ordering is certainly required (cf. [Cha84]). However, given T with <strong>an</strong>y orderingof its rows <strong>an</strong>d columns, LUSOL reconstructs the original ordering during its “backward tri<strong>an</strong>gle”phase. TPP accepts all +1 pivots without numerical testing, <strong>an</strong>d TCP would not reject them evenwith FactorTol = 1.0.Example 4.2 [Cha84, p. 544]. This example actually comes from [Wil65, p. 308 <strong>an</strong>d p. 325]


CHAPTER 4. RANK-REVEALING SPARSE LU 38<strong>an</strong>d demonstrates that partial pivoting may not produce diagonals small enough to reveal the r<strong>an</strong>ksuccessfully. The 21 × 21 matrix⎛W =⎜⎝10 11 9 1· · ·1 1 11 0 11 −1 1· · ·1 −9 11 −10⎞⎟⎠has one singularity (i.e., exact r<strong>an</strong>k 20). MD <strong>an</strong>d MS both give |U 21,21 | = 8.2 × 10 −11 , just missingthe indication of singularity. MCS successfully reveals the singularity (one zero pivot), as do TPP<strong>an</strong>d TCP.Example 4.3 [HLY92, p. 133]. Starting with the matrix A = ( T 40)T40<strong>an</strong>d making the followingch<strong>an</strong>ges:<strong>for</strong> i = 1 : 40,A(1 : 40, 81 − i) = A(1 : 40, 81 − i) + A(1 : 40, i)A(i, 41 : 80) = A(i, 41 : 80) + A(81 − i, 41 : 80)endgives a matrix⎛A =⎜⎝1 −1 · · −1 −1 · · −1 21 −1 · · · 2 −1· · · · · · · ·· −1 −1 2 · ·1 2 −1 · · −11 −1 · · −11 −1 ·· · ·· −11⎞.⎟⎠The three smallest singular(values are [0.62, 1.9 × 10 −12 , 1.9 × 10 −12 ] <strong>an</strong>d Hw<strong>an</strong>g, Lin, <strong>an</strong>d Y<strong>an</strong>g1.0×10[HLY92] find U 22 =−15 3.6×10 −12. MD <strong>an</strong>d MS both give U = A (failure). With MCS,−3.6×10 −12 1.3×10 −23 )the two smallest diagonals of U are [−0.5, 1.5 × 10 −11 ]. TPP also produces U = A, but TCP givesthe last three pivots as [−0.67, −7.3 × 10 −12 , −3.6 × 10 −12 ] (success).( )T30Example 4.4 [HLY92, pp. 133–134]. Starting with the matrix A = T30 <strong>an</strong>d makingT30the following ch<strong>an</strong>ges:


CHAPTER 4. RANK-REVEALING SPARSE LU 39<strong>for</strong> i = 1 : 30,A(1 : 90, 61 − i) = A(1 : 40, 61 − i) + A(1 : 90, i)A(i, 1 : 90) = A(i, 1 : 90) + A(61 − i, 1 : 90)end<strong>for</strong> i = 1 : 30,A(1 : 90, 91 − i) = A(1 : 40, 91 − i) + A(1 : 90, 30 + i)A(30 + i, 1 : 90) = A(30 + i, 1 : 90) + A(91 − i, 1 : 90)endgives a matrix⎛A =⎜⎝1 −1 · · −1 −1 · · −1 2 2 −1 · · −11 −1 · · · 2 −1 −1 2 · ·· · · · · · · · · · · · ·· −1 −1 2 · · · · 2 −11 2 −1 · · −1 −1 · · −1 21 −1 · · −1 −1 · · −1 21 −1 · · · 2 −1· · · · · · · ·· −1 −1 2 · ·1 2 −1 · · −11 −1 · · −11 −1 ·· · ·· −11⎞⎟⎠with three near-singularities. The four smallest singular values are [0.40, 2.8×10 −9 , 1.4×10 −9 , 1.4×10 −9 ], <strong>an</strong>d Hw<strong>an</strong>g, Lin, <strong>an</strong>d Y<strong>an</strong>g [HLY92] find U 22 with values r<strong>an</strong>ging in magnitude from 3.7×10 −9to 5.9 × 10 −26 . MD <strong>an</strong>d MS both return U = A. For MCS the last three pivots (the smallest) were[0.33, 3.3 × 10 −4 , −1.5 × 10 −5 ]. TPP also produces U = A, but TCP gives the last four pivots as[−0.67, −7.4 × 10 −9 , 3.7 × 10 −9 , −3.7 × 10 −9 ].Example 4.5 [HLY92, pp. 134–136]. Starting with the matrix A = ( W W) <strong>an</strong>d making thefollowing ch<strong>an</strong>ges:<strong>for</strong> i = 1 : 21,A(1 : 21, 43 − i) = A(1 : 21, 43 − i) + A(1 : 21, i)A(i, 22 : 42) = A(i, 22 : 42) + A(43 − i, 22 : 42)end


CHAPTER 4. RANK-REVEALING SPARSE LU 40gives a matrix⎛A =⎜⎝10 1 2 01 9 1 2 0 2· · · · · ·1 0 1 2 0 2· · · · · ·1 −9 1 2 0 21 −10 0 210 11 9 1· · ·1 0 1· · ·1 −9 11 −10⎞⎟⎠with two singularities (i.e., exact r<strong>an</strong>k 40). Hw<strong>an</strong>g, Lin, <strong>an</strong>d Y<strong>an</strong>g [HLY92] find U 22 with valuesbetween 10 −16 <strong>an</strong>d 10 −17 . MD <strong>an</strong>d MS both give 8.2 × 10 −11 <strong>for</strong> the last pivot. MCS successfullyreveals the singularity (two zero pivots), as do TPP <strong>an</strong>d TCP.The above examples demonstrate that, of all the factorizations tested, only TCP is likely to ber<strong>an</strong>k-revealing (<strong>for</strong> all but the most pathological example), as long as FactorTol is suitably low. Fortri<strong>an</strong>gular matrices A or permuted tri<strong>an</strong>gles, a necessary condition is 1 ≤ FactorTol < max |A ij |,which is possible <strong>for</strong> matrices with at least some variation in the magnitude of the nonzero elements.4.4 Dynamic Programming ApplicationThe sparse TCP factorization was developed to enh<strong>an</strong>ce stability within the computation of Laurentexp<strong>an</strong>sion coefficients (see Chapter 3). Dr. Richard Grinold (Barclays Global Investments,S<strong>an</strong> Fr<strong>an</strong>cisco) provided a large, sparse, substochastic <strong>dynamic</strong> <strong>programming</strong> problem, <strong>an</strong>d solvingthis example by policy improvement requires the Laurent exp<strong>an</strong>sion coefficients <strong>for</strong> m<strong>an</strong>y differentpolicies. The method from Chapter 3 finds these coefficients by solving multiple linear systems ofvarying size, each involving a sparse matrix Q with at most one singularity. Correct r<strong>an</strong>k detectionof each Q is crucial <strong>for</strong> the method. This motivated the sparse, RRLU factorization presented here.Extracting a subset of the Q matrices from the Barclays example provided a test-bed. Each Q(n × n, Q ≡ P − I <strong>for</strong> P substochastic) has the following special properties:• Q has at most 1 singularity;• 0 ≤ q ij ≤ 1, i ≠ j;• q ii = − ∑ j≠i q ij;


CHAPTER 4. RANK-REVEALING SPARSE LU 4110 14Condition of Factors10 1310 1210 11condition(L) * condition(U)10 1010 910 810 710 6PSfrag replacements10 510 4MS (thresh = 0.2)MCS (thresh = 0.2)MS (thresh = 1.0)MCS (thresh = 1.0)500 1000 1500 2000 2500 3000 3500 4000Matrix DimensionFigure 4.1: Comparing the condition of the factors <strong>for</strong> MS <strong>an</strong>d MCS<strong>for</strong> 1 ≤ i, j ≤ n. The same five factorizations used in §4.3 were applied to these matrices, butdifferent values <strong>for</strong> thresh, FactorTol <strong>an</strong>d DropTol were used. The results confirmed that TCP isat least competitive with all the other factorizations <strong>for</strong> the measures taken. Comparisons betweenthe factorizations were made using four different measures:R<strong>an</strong>k-revealing property Does the diagonal of the U factor reveal the r<strong>an</strong>k of the matrix? As in§4.3, the r<strong>an</strong>k estimation is given by the number of diagonal elements of U > tol ≡ n × ‖Q‖ × ɛ(see §4.2.7).Condition of the factors Numerically stable policy improvement (see Chapter 3) solves systemsof linear equations using only the the nonsingular submatrices of L <strong>an</strong>d U. If Q is nonsingular,these are the factors L <strong>an</strong>d U in their entirety. If Q has a singularity, then these submatrices arethe factors L <strong>an</strong>d U with the last row <strong>an</strong>d column removed. If a factorization is numericallystable, then the condition of the non-singular submatrices should be (relatively) low. Thecondition of L multiplied by the condition of U gives a single measure of the condition of the


CHAPTER 4. RANK-REVEALING SPARSE LU 428 x 1057MS (thresh = 0.2)MCS (thresh = 0.2)MS (thresh = 1.0)MCS (thresh = 1.0)Sparsity of Factors6nonzeros(L) + nonzeros(U)5432PSfrag replacements10500 1000 1500 2000 2500 3000 3500 4000Matrix DimensionFigure 4.2: Comparing the sparsity of the factors <strong>for</strong> MS <strong>an</strong>d MCSfactors. Since L <strong>an</strong>d U are sparse, this qu<strong>an</strong>tity is estimated by condest(L) * condest(U). 1Sparsity of factors Good sparse LU factorizations keep the number of nonzeros in the factors toa minimum. The total number of nonzeros in the factors (i.e., sum of the nonzeros in L <strong>an</strong>dU) gives a measure of the sparsity of the factors.Time per factorization If a factorization is to be used repeatedly on different matrices (as is thecase in m<strong>an</strong>y applications of sparse LU factorizations, including the numerically stable policyimprovement from Chapter 3), then the <strong>time</strong> <strong>for</strong> a single factorization needs to be small.Timing the CPU during each factorization gives this measure.Figures 4.1-4.3 show a comparison of MS <strong>an</strong>d MCS <strong>for</strong> two values of thresh (0.2 <strong>an</strong>d 1). Noticethat <strong>for</strong> thresh = 1, the condition of the factors decreases (representing greater numerical stability).In fact, with thresh = 0.2, MS fails to reveal the r<strong>an</strong>k <strong>for</strong> five of the test matrices. The condition1 See Matlab help <strong>for</strong> condest in appendix B.


CHAPTER 4. RANK-REVEALING SPARSE LU 4343.5MS (thresh = 0.2)MCS (thresh = 0.2)MS (thresh = 1.0)MCS (thresh = 1.0)Time per Factorization32.5Time (s)21.51PSfrag replacements0.50500 1000 1500 2000 2500 3000 3500 4000Matrix DimensionFigure 4.3: Comparing <strong>time</strong> per factorization <strong>for</strong> MS <strong>an</strong>d MCSof the factors <strong>for</strong> these five cases were so high they are omitted from Figure 4.1 (to provide a clearercomparison of this measure when the factorizations are r<strong>an</strong>k-revealing). For both Figure 4.2 <strong>an</strong>dFigure 4.3 different values of thresh give the same result <strong>for</strong> MCS. However, different values ofthresh affect MS in the expected way (i.e., higher values <strong>for</strong> thresh improve the condition of thefactors, decrease the sparsity of the factors, <strong>an</strong>d slow the factorization down). All three figuresshow that MCS works considerably better as a sparse r<strong>an</strong>k-revealing factorization <strong>for</strong> the giventest matrices. Also, it shows that using thresh = 1 improves the conditions of the factors withoutsignific<strong>an</strong>tly decreasing sparsity or slowing the factorization down. Hereafter, MCS refers to MatlabSparse LU with colmmd preprocessing <strong>an</strong>d thresh = 1.Of the other factorizations tested, TPP <strong>an</strong>d TCP were tested using FactorTol = 5 <strong>an</strong>d 10.These factorizations along with MD all successfully revealed the r<strong>an</strong>k of every test matrix. It maybe that the special structure of the test matrices allows partial pivoting to become r<strong>an</strong>k revealing.Figure 4.4 shows that TPP per<strong>for</strong>ms the worst numerically (<strong>for</strong> FactorTol = 5 <strong>an</strong>d 10), butTCP with FactorTol = 5 <strong>an</strong>d MCS both per<strong>for</strong>m almost as well as MD using this measure. For the


CHAPTER 4. RANK-REVEALING SPARSE LU 4410 14Condition of Factors10 1210 10condition(L) * condition(U)10 810 610 4PSfrag replacements10 210 0MDMCS (thresh = 1.0)TPP (FactorTol = 5.0)TCP (FactorTol = 5.0)TPP (FactorTol = 10.0)TCP (FactorTol = 10.0)500 1000 1500 2000 2500 3000 3500 4000Matrix DimensionFigure 4.4: Comparing the condition of the factorssparsity of factors <strong>an</strong>d the <strong>time</strong> per factorization measures, MD is omitted since it per<strong>for</strong>ms so poorlyas to obscure differences between the other factorizations. When comparing the factorizations usingthe sparsity of factors measure (see Figure 4.5), the per<strong>for</strong>m<strong>an</strong>ce order is reversed. TPP per<strong>for</strong>msthe best (with either FactorTol), then TCP, <strong>an</strong>d lastly MCS. Even with FactorTol = 5, TCPshows a considerable improvement <strong>over</strong> MCS.Next, Figure 4.6 compares the <strong>time</strong> per factorization <strong>an</strong>d shows the one weakness of TCP. Evenwith FactorTol = 10 (allowing more flexibility in choosing diagonals, <strong>an</strong>d there<strong>for</strong>e faster execution),the <strong>time</strong> per factorization appears to be growing faster <strong>for</strong> TCP th<strong>an</strong> either MCS or TPP. It appearsthat both TCP <strong>an</strong>d MCS take polynomial <strong>time</strong> with respect to the number of nonzeros in the originalmatrix, whereas TPP appears to grow linearly.A further test helped discern the merits of the different sparse LU factorizations. The largesttest matrix was used (3699 × 3699 with one singularity). With FactorTol r<strong>an</strong>ging from 1 to 50 (<strong>an</strong>dthresh = 1/FactorTol), it became clear how MCS, TPP <strong>an</strong>d TCP could be ch<strong>an</strong>ged to improveper<strong>for</strong>m<strong>an</strong>ce in <strong>an</strong>y of the measures used. All successfully reveal the r<strong>an</strong>k of the matrix, but it


CHAPTER 4. RANK-REVEALING SPARSE LU 451.81.62 x 105MCS (thresh = 1.0)TPP (FactorTol = 5.0)TCP (FactorTol = 5.0)TPP (FactorTol = 10.0)TCP (FactorTol = 10.0)Sparsity of Factors1.4nonzeros(L) + nonzeros(U)1.210.80.6PSfrag replacements0.40.20500 1000 1500 2000 2500 3000 3500 4000Matrix DimensionFigure 4.5: Comparing the sparsity of the factorsappears that finding the r<strong>an</strong>k of these particular test matrices is not difficult. Figures 4.7–4.9 showsthat <strong>for</strong> MCS only the condition of the factors is affected by ch<strong>an</strong>ges in thresh. Setting thresh= 1 appears to give the optimal per<strong>for</strong>m<strong>an</strong>ce <strong>for</strong> MCS. For both TPP <strong>an</strong>d TCP, the condition ofthe factors improves as FactorTol drops towards 1, but the sparsity of the factors <strong>an</strong>d <strong>time</strong> perfactorization both deteriorate. It appears that FactorTol = 10.0 would be a reliable yet efficientvalue.Finally, Figure 4.10 compares the running <strong>time</strong>s of MCS, TPP, <strong>an</strong>d TCP. It also shows O(p)<strong>an</strong>d O(p 2 ) <strong>time</strong>s (the lower line <strong>an</strong>d upper line, respectively), where p = nnz(A), the number ofnonzeros in A. The running <strong>time</strong> <strong>for</strong> TPP appears to be O(p) (very satisfactory). TCP appearsto be competitive with MCS (with more variability), but both factorizations display <strong>time</strong>s that areO(p k ), 1 < k < 2. The cost of maintaining A max is evident. On the other h<strong>an</strong>d, since p ≪ m × n ifcatastrophic fill-in doesn’t occur, these factorizations will generally per<strong>for</strong>m better th<strong>an</strong> the O(n 3 )<strong>time</strong> <strong>for</strong> dense LU factorizations.Figure 4.11 ch<strong>an</strong>ges the <strong>horizon</strong>tal axis of Figure 4.10 from the number of nozeros in A to the


CHAPTER 4. RANK-REVEALING SPARSE LU 462.52MCS (thresh = 1.0)TPP (FactorTol = 5.0)TCP (FactorTol = 5.0)TPP (FactorTol = 10.0)TCP (FactorTol = 10.0)Time per Factorization1.5Time (s)1PSfrag replacements0.50500 1000 1500 2000 2500 3000 3500 4000Matrix DimensionFigure 4.6: Comparing the <strong>time</strong> per factorizationsparsity of the factors (total nonzeros in L <strong>an</strong>d U). Only the TCP factorization (with FactorTol =5.0 <strong>an</strong>d 10.0) are shown, along with O(p) <strong>an</strong>d O(p 2 ) <strong>time</strong>s. Now p = nnz(L + U) represents densityof the factors. Ideally the running <strong>time</strong> should be O(p), but it appears that TCP is again O(p k ) <strong>for</strong>1 < k < 2. TCP also runs in polynomial <strong>time</strong> with respect to the sparsity of its factors, but thedegree of the polynomial is less th<strong>an</strong> 2.4.5 Optimization ApplicationOne major application of sparse LU factorizations is within large-scale optimization systems. Inparticular, LUSOL is used within MINOS <strong>an</strong>d SNOPT [MS83, GMS97] to implement the simplexmethod <strong>an</strong>d various quadratic <strong>programming</strong> <strong>an</strong>d nonlinear <strong>programming</strong> algorithms. In this context,if a basis matrix appears to be singular, certain columns are replaced by unit vectors (correspondingto slack variables) <strong>an</strong>d the factorization is repeated. LUSOL signals singularity or ill-conditioningby flagging small diagonals of U as described in §4.2.7. The <strong>new</strong> TCP option has proved invaluable


CHAPTER 4. RANK-REVEALING SPARSE LU 4710 1510 14MCS (thresh = 1 / FactorTol)TPPTCPCondition of Factors10 13condition(L) * condition(U)10 1210 1110 1010 9PSfrag replacements10 80 5 10 15 20 25 30 35 40 45 50FactorTolFigure 4.7: Comparing the condition of the factors as FactorTol ch<strong>an</strong>ges<strong>for</strong> ensuring reliability in these tests. Preliminary experience with SNOPT is described in [GMS02].In particular, the first basis matrix is often ill-conditioned as it is chosen heuristically by theoptimizer’s Crash procedure, or input by the user (with unpredictable properties). In one case withn = 10000, the usual TPP factorization with FactorTol= 4.0 indicated 243 singularities. Afterslacks were inserted, the next factorization indicated 47 additional singularities, the next a further25, then 18, 14, 10, <strong>an</strong>d so on. Nearly 30 TPP factorizations <strong>an</strong>d 460 <strong>new</strong> slacks were requiredbe<strong>for</strong>e the basis was regarded as suitably nonsingular. Since L <strong>an</strong>d U each had about a millionnonzeros in all factorizations, the repeated failures were rather expensive. In contrast, a single TCPfactorization with FactorTol = 2.5 indicated 100 singularities, after which the modified B provedto be very well conditioned. Although L <strong>an</strong>d U were more dense (1.35 million nonzeros each) <strong>an</strong>dmuch more expensive to compute, the subsequent optimization required signific<strong>an</strong>tly fewer major<strong>an</strong>d minor iterations.In general, TCP is the primary recourse when unexpected growth occurs in ‖x‖ following solutionof Ax = b. (If necessary, FactorTol is gradually reduced towards 1.1 until x appears to be accurate.)


CHAPTER 4. RANK-REVEALING SPARSE LU 4813 x 104Sparsity of Factors1211MCS (thresh = 1 / FactorTol)TPPTCPnonzeros(L) + nonzeros(U)10987PSfrag replacements650 5 10 15 20 25 30 35 40 45 50FactorTolFigure 4.8: Comparing the sparsity of the factors as FactorTol ch<strong>an</strong>gesThe <strong>new</strong> option has proved valuable <strong>for</strong> some optimization problems arising from partial differentialequations. A regular “marching pattern” is some<strong>time</strong>s present in A, particularly in the first basisfrom Crash (as mentioned). With threshold partial pivoting the factors display no small diagonalsin U, yet the TCP factors reveal a large number of dependent columns.4.6 ContributionsThis chapter presents a <strong>new</strong> pivoting strategy <strong>for</strong> sparse LU factorizations, threshold complete pivoting(TCP). As in the dense case, this pivoting strategy is likely to be more stable th<strong>an</strong> the traditionalthreshold partial pivoting, <strong>an</strong>d most import<strong>an</strong>tly it appears to be r<strong>an</strong>k-revealing <strong>for</strong> all but the mostpathological matrices. The r<strong>an</strong>k-revealing property is discussed <strong>an</strong>d compared <strong>for</strong> some classical examplesas well as the DP application that motivated developing TCP. Finally, this chapter describesthe benefits of TCP within classical optimization algorithms.


CHAPTER 4. RANK-REVEALING SPARSE LU 492.42.2Time per FactorizationMCS (thresh = 1 / FactorTol)TPPTCP21.81.6Time (s)1.41.210.8PSfrag replacements0.60.40 5 10 15 20 25 30 35 40 45 50FactorTolFigure 4.9: Comparing the <strong>time</strong> per factorization as FactorTol ch<strong>an</strong>gesAcknowledgement The work contained within this chapter is joint work with Prof. Michael A.Saunders.


CHAPTER 4. RANK-REVEALING SPARSE LU 502.52MCS (thresh = 1.0)TPP (FactorTol = 5.0)TCP (FactorTol = 5.0)TPP (FactorTol = 10.0)TCP (FactorTol = 10.0)Time per Factorization1.5Time (s)1PSfrag replacements0.500.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2Nonzeros in Ax 10 4Figure 4.10: Time per factorization as a function of the nonzeros in A


CHAPTER 4. RANK-REVEALING SPARSE LU 512.5TCP (FactorTol = 5.0)TCP (FactorTol = 10.0)Time per Factorization21.5Time (s)10.5PSfrag replacements02 3 4 5 6 7 8 9 10 11 12 13nonzeros(L) + nonzeros(U)x 10 4Figure 4.11: Time per factorization as a function of the sparsity of the factors


Appendix ACounterexample to Method <strong>for</strong>Finding 1-Optimal PoliciesDenardo [Den71, p. 491] sketches two <strong>methods</strong> he states will find a 1-optimal policy. One method usespolicy improvement <strong>an</strong>d the other uses linear <strong>programming</strong> together with some auxiliary routines.PSfrag replacementsBoth <strong>methods</strong> c<strong>an</strong> fail to find a 1-optimal policy as the following example shows.−1Example A.1 Consider the deterministic system in Figure A.1.b −313A policy is <strong>an</strong> ordered pair ofa2−2−44a2b14Figure A.1: Counterexample <strong>for</strong> Denardo’s Methodactions <strong>for</strong> the states 1 <strong>an</strong>d 2. For example, ba is the policy that takes action b in state 1 <strong>an</strong>d actiona in state 2. In this example, all policies are 0-optimal, but only α ≡ aa is 1-optimal.Starting with δ ≡ bb <strong>an</strong>d implementing either of Denardo’s <strong>methods</strong> shows first that δ is 0-optimal [Den70a]. Next Denardo finds a column S-vector v 1 that satisfies Q δ v 1 = vδ 0 . One such52


CHAPTER A. COUNTEREXAMPLE FOR 1-OPTIMALITY METHOD 53vector is v 1 =(1 −1 0 2) T. He then defines the set E 1 of policies γ <strong>for</strong> which Qv 1 = v 0 δ <strong>an</strong>dsolves a maximum-reward-rate problem <strong>over</strong> E 1 . Since E 1 is the singleton set {δ}, δ is the uniquesolution of that problem whether solved by policy improvement or linear <strong>programming</strong>. The finalstep of each of Denardo’s <strong>methods</strong> finds a policy γ that is tr<strong>an</strong>sient on states where δ is tr<strong>an</strong>sient,viz., states 1 <strong>an</strong>d 2. Consequently, γ c<strong>an</strong>not be α because states 1 <strong>an</strong>d 2 are recurrent under α.Thus Denardo’s <strong>methods</strong> both fail to find a 1-optimal policy.


Appendix BMatlab Help FilesLU LU factorization.[L,U] = LU(X) stores <strong>an</strong> upper tri<strong>an</strong>gular matrix in U <strong>an</strong>d a"psychologically lower tri<strong>an</strong>gular matrix" (i.e. a product oflower tri<strong>an</strong>gular <strong>an</strong>d permutation matrices) in L, so that X =L*U. X must be square.[L,U,P] = LU(X) returns lower tri<strong>an</strong>gular matrix L, uppertri<strong>an</strong>gular matrix U, <strong>an</strong>d permutation matrix P so that P*X = L*U.LU(X), with one output argument, returns the output from LAPACK’SDGETRF or ZGETRF routine.LU(X,THRESH) controls pivoting in sparse matrices, where THRESHis a pivot threshold in [0,1]. Pivoting occurs when the diagonalentry in a column has magnitude less th<strong>an</strong> THRESH <strong>time</strong>s themagnitude of <strong>an</strong>y sub-diagonal entry in that column.THRESH = 0 <strong>for</strong>ces diagonal pivoting. THRESH = 1 is the default.See also LUINC, QR, RREF.COLMMD Column minimum degree permutation.P = COLMMD(S) returns the column minimum degree permutationvector <strong>for</strong> the sparse matrix S. For a non-symmetric matrix S,S(:,P) tends to have sparser LU factors th<strong>an</strong> S.See also SYMMMD, SYMRCM, COLPERM.54


APPENDIX B. MATLAB HELP FILES 55CONDEST 1-norm condition number estimate.C = CONDEST(A) computes a lower bound C <strong>for</strong> the 1-normcondition number of a square matrix A.C = CONDEST(A,T) ch<strong>an</strong>ges T, a positive integer parameter equalto the number of columns in <strong>an</strong> underlying iteration matrix.Increasing the number of columns usually gives a better conditionestimate but increases the cost. The default is T = 2, whichalmost always gives <strong>an</strong> estimate correct to within a factor 2.[C,V] = CONDEST(A) also computes a vector V which is <strong>an</strong>approximate null vector if C is large. V satisfies NORM(A*V,1)= NORM(A,1)*NORM(V,1)/C.Note: CONDEST invokes RAND. If repeatable results are requiredthen invoke RAND(’STATE’,J), <strong>for</strong> some J, be<strong>for</strong>e calling thisfunction.Uses block 1-norm power method of Higham <strong>an</strong>d Tisseur.See also NORMEST1, COND, NORM.


Bibliography[AA98][AS80][Bal61][Bat73]K. E. Avrachenkov <strong>an</strong>d E. Altm<strong>an</strong>, Sensitive discount optimality via nested linear programs<strong>for</strong> ergodic Markov decision processes, Unpublished m<strong>an</strong>uscript, 1998.B. Aspvall <strong>an</strong>d Y. Shiloach, A polynomial <strong>time</strong> algorithm <strong>for</strong> solving systems of linearinequalities with two variables per inequality, SIAM J. Comput. 9 (1980), no. 4, 827–845.M. Balinski, On Solving Discrete Stochastic Decision Problems, Study 2. Mathematica,Princeton, New Jersey, 1961.J. A. Bather, Optimal decision procedures <strong>for</strong> finite Markov chains. Part III: Generalconvex systems, Adv. Appl. Prob. 5 (1973), 541–558.[Bel58] R. A. Bellm<strong>an</strong>, On a routing problem, Quart. Appl. Math. 16 (1958), 87–90.[Bla62] D. Blackwell, Discrete <strong>dynamic</strong> <strong>programming</strong>, Ann. Math. Statist. 33 (1962), no. 2,719–726.[Bus71][Cha84][d’E60][DEG + 99][Den70a]P. A. Businger, Monitoring the numerical stability of Gaussi<strong>an</strong> elimination, Numer.Math. 16 (1971), 360–361.T. F. Ch<strong>an</strong>, On the existence <strong>an</strong>d computation of LU-factorizations with small pivots,Math. Comp. 42 (1984), no. 166, 535–547.F. d’Epenoux, Sur un problème de production et de stockage d<strong>an</strong>s l’aléatoire, RevueFr<strong>an</strong>çaise de Recherche Opérationnelle 4 (1960), no. 14, 3–13, An English tr<strong>an</strong>slationappears as: A Probabilistic Production <strong>an</strong>d Inventory Problem. M<strong>an</strong>agement Sci. 10, 1,98–108.J. W. Demmel, S. C. Eisenstat, J. R. Gilbert, Xiaoye S. Li, <strong>an</strong>d J. W. H. Liu, A supernodalapproach to sparse partial pivoting, SIAM J. Matrix Anal. Appl. 20 (1999), no. 3,720–755.E. V. Denardo, Computing a bias-optimal policy in a discrete-<strong>time</strong> Markov decisionproblem, Oper. Res. 18 (1970), 279–289.56


BIBLIOGRAPHY 57[Den70b] , On linear <strong>programming</strong> in a Markov decision problem, M<strong>an</strong>agement Sci. 16(1970), no. 5, 281–288.[Den71] , Markov re<strong>new</strong>al programs with small interest rates, Ann. Math. Statist. 42(1971), no. 2, 477–496.[Der62] C. Derm<strong>an</strong>, On sequential decisions <strong>an</strong>d Markov chains, M<strong>an</strong>agement Sci. 9 (1962),no. 1, 16–24.[DF68][dG60][DR96][Duf77][EV75][For56][Fos97][GMS97]E. V. Denardo <strong>an</strong>d B. L. Fox, Multichain Markov re<strong>new</strong>al programs, SIAM J. Appl.Math. 16 (1968), no. 3, 468–487.G. de Ghellinck, Les problèms de décisions séquentielles, Cahiers du Centre d’Etudesde Recherche Opérationnelle 2 (1960), no. 2, 161–179.I. S. Duff <strong>an</strong>d J. K. Reid, The design of MA48: A code <strong>for</strong> the direct solution of sparseunsymmetric linear systems of equations, ACM Tr<strong>an</strong>s. Math. Software 22 (1996), no. 2,187–226.I. S. Duff, MA28—a set of Fortr<strong>an</strong> subroutines <strong>for</strong> sparse unsymmetric linear equations,Report AERE R8730, Atomic Energy Research Establishment, Harwell, Engl<strong>an</strong>d, 1977.B. C. Eaves <strong>an</strong>d A. F. Veinott, Jr., Policy improvement in positive, negative <strong>an</strong>d stoppingMarkov decision chains, Unpublished m<strong>an</strong>uscript, 1975.L. R. Ford, Network flow theory, Tech. Report P-923, The RAND Corporation, S<strong>an</strong>taMonica, CA, 1956.L. V. Foster, The growth factor <strong>an</strong>d efficiency of Gaussi<strong>an</strong> elimination with rook pivoting,J. Comput. Appl. Math. 86 (1997), 177–194.P. E. Gill, W. Murray, <strong>an</strong>d M. A. Saunders, User’s guide <strong>for</strong> SNOPT 5.3: A Fortr<strong>an</strong>package <strong>for</strong> large-scale nonlinear <strong>programming</strong>, Numerical Analysis Report 97-5,Department of Mathematics, University of Cali<strong>for</strong>nia, S<strong>an</strong> Diego, La Jolla, CA, 1997,Revised May 1998.[GMS02] , SNOPT: An SQP algorithm <strong>for</strong> large-scale constrained optimization, SIAM J.Optim. (2002).[GMSW87] P. E. Gill, W. Murray, M. A. Saunders, <strong>an</strong>d M. H. Wright, Maintaining LU factors ofa general sparse matrix, Linear Algebra Appl. 88/89 (1987), 239–270.[Gup00]A. Gupta, WSMP: Watson Sparse Matrix Package, Part II—direct solution of generalsparse systems, IBM Research Report RC 21888 (98472), T. J. Watson Research Center,Yorktown Heights, NY, 2000, http://www.cs.umn.edu/~agupta/wsmp.html.


BIBLIOGRAPHY 58[HA99] M. Hartm<strong>an</strong> <strong>an</strong>d C. Arguelles, Tr<strong>an</strong>sience bounds <strong>for</strong> long walks, Math. Oper. Res. 24(1999), 414–439.[HK79][HL97][HLY92][HN94][How60][HP91][HT00][Kar78][Kha79]A. Hordijk <strong>an</strong>d L. C. M. Kallenberg, Linear <strong>programming</strong> <strong>an</strong>d Markov decision chains,M<strong>an</strong>agement Sci. 25 (1979), no. 4, 352–362.T. Hw<strong>an</strong>g <strong>an</strong>d W. Lin, Improved bound <strong>for</strong> r<strong>an</strong>k-revealing LU factorizations, LinearAlgebra Appl. 261 (1997), 173–186.T. Hw<strong>an</strong>g, W. Lin, <strong>an</strong>d E. K. Y<strong>an</strong>g, R<strong>an</strong>k-revealing LU factorizations, Linear AlgebraAppl. 175 (1992), 115–141.D. S. Hochbaum <strong>an</strong>d J. Naor, Simple <strong>an</strong>d fast algorithms <strong>for</strong> linear <strong>an</strong>d integer programswith two variables per inequality, SIAM J. Comput. 23 (1994), no. 6, 1179–1192.R. A. Howard, Dynamic Programming <strong>an</strong>d Markov Processes, Technology Press-Wiley,Cambridge, Massachusetts, 1960.M. Haviv <strong>an</strong>d M. L. Puterm<strong>an</strong>, An improved algorithm <strong>for</strong> solving communicating averagereward Markov decision processes, Ann. Oper. Res. 28 (1991), 229–242.N. J. Higham <strong>an</strong>d F. Tisseur, A block algorithm <strong>for</strong> matrix 1-norm estimation, with<strong>an</strong> application to 1-norm pseudospectra, SIAM J. Matrix Anal. Appl. 21 (2000), no. 4,1185–1201.R. M. Karp, A characterization of the minimum cycle me<strong>an</strong> in a digraph, Discrete Math.23 (1978), 309–311.L. G. Khachi<strong>an</strong>, A polynomial algorithm <strong>for</strong> linear <strong>programming</strong>, Doklady Akad. NaukSSSR 244 (1979), no. 5, 1093–1096, English tr<strong>an</strong>slation in Soviet Math. Doklady 20,191–194.[M<strong>an</strong>60] A. S. M<strong>an</strong>ne, Linear <strong>programming</strong> <strong>an</strong>d sequential decisions, M<strong>an</strong>agement Sci. 6 (1960),no. 3, 259–267.[Mar57]H. M. Markowitz, The elimination <strong>for</strong>m of inverse <strong>an</strong>d its applications to linear <strong>programming</strong>,M<strong>an</strong>agement Sci. 3 (1957), 255–269.[MS83] B. A. Murtagh <strong>an</strong>d M. A. Saunders, MINOS 5.0 User’s Guide, Report SOL 83-20,Department of Operations Research, St<strong>an</strong><strong>for</strong>d University, St<strong>an</strong><strong>for</strong>d, CA, 1983.[MV69]B. L. Miller <strong>an</strong>d A. F. Veinott, Jr., Discrete <strong>dynamic</strong> <strong>programming</strong> with a small interestrate, Ann. Math. Statist. 40 (1969), no. 2, 366–370.


BIBLIOGRAPHY 59[Rei82][Rot75][RV91][SS90]J. K. Reid, A sparsity-exploiting vari<strong>an</strong>t of the Bartels-Golub decomposition <strong>for</strong> linear<strong>programming</strong> bases, Math. Prog. 24 (1982), 55–69.U. G. Rothblum, Normalized Markov decision chains I: Sensitive discount optimality,Oper. Res. 23 (1975), no. 4, 785–795.K. W. Ross <strong>an</strong>d R. Varadaraj<strong>an</strong>, Multichain Markov decision processes with a samplepath constraint: A decomposition approach, Math. Oper. Res. 16 (1991), no. 1, 195–207.U. H. Suhl <strong>an</strong>d L. M. Suhl, Computing sparse LU factorizations <strong>for</strong> large-scale linear<strong>programming</strong> bases, ORSA J. Comput. 2 (1990), 325–335.[Tar72] R. E. Tarj<strong>an</strong>, Depth-first search <strong>an</strong>d linear graph algorithms, SIAM J. Comput. 1 (1972),146–160.[Tre87][Vei66][Vei68][Vei69][Vei74][Wil65][Zla80][ZWS81]W. F. Trench, A note on solving nearly tri<strong>an</strong>gular Toeplitz systems, Linear AlgebraAppl. 93 (1987), 57–65.A. F. Veinott, Jr., On finding optimal policies in discrete <strong>dynamic</strong> <strong>programming</strong> with nodiscounting, Ann. Math. Statist. 37 (1966), no. 5, 1284–1294., Discrete <strong>dynamic</strong> <strong>programming</strong> with sensitive optimality criteria, Ann. Math.Statist. 39 (1968), 1372, Preliminary Report., Discrete <strong>dynamic</strong> <strong>programming</strong> with sensitive discount optimality criteria, Ann.Math. Statist. 40 (1969), no. 5, 1635–1660., Markov decision chains, Studies in Optimization. MAA Studies in Mathematics(B. C. Eaves <strong>an</strong>d G. B. D<strong>an</strong>tzig, eds.), vol. 10, Mathematical Association of America,1974, pp. 124–159.J. H. Wilkinson, The Algebraic Eigenvalue Problem, Ox<strong>for</strong>d University Press, London,1965.Z. Zlatev, On some pivotal strategies in Gaussi<strong>an</strong> elimination by sparse technique, SIAMJ. Numer. Anal. 17 (1980), no. 4, 18–30.Z. Zlatev, J. Wasniewski, <strong>an</strong>d K. Schaumburg, Y12M: Solution of Large <strong>an</strong>d SparseSystems of Linear Algebraic Equations, Number 121 in Lecture Notes in ComputerScience, Springer-Verlag, Berlin, 1981.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!