10.07.2015 Views

Hierarchical Model-Based Reinforcement Learning ... - Video lectures

Hierarchical Model-Based Reinforcement Learning ... - Video lectures

Hierarchical Model-Based Reinforcement Learning ... - Video lectures

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<strong>Learning</strong> with Hierarchies of <strong>Model</strong>sThe R-MAXQ AlgorithmSummary<strong>Hierarchical</strong> <strong>Model</strong>-<strong>Based</strong> <strong>Reinforcement</strong><strong>Learning</strong>: R-MAX + MAXQNicholas K. JongPeter StoneDepartment of Computer SciencesUniversity of Texas at AustinInternational Conference on Machine <strong>Learning</strong>, 2008Nicholas K. Jong, Peter Stone<strong>Hierarchical</strong> <strong>Model</strong>-<strong>Based</strong> <strong>Reinforcement</strong> <strong>Learning</strong>


Outline<strong>Learning</strong> with Hierarchies of <strong>Model</strong>sThe R-MAXQ AlgorithmSummary1 <strong>Learning</strong> with Hierarchies of <strong>Model</strong>s<strong>Learning</strong> in Structured EnvironmentsMAXQ Decomposition2 The R-MAXQ AlgorithmR-MAX ExplorationResultsNicholas K. Jong, Peter Stone<strong>Hierarchical</strong> <strong>Model</strong>-<strong>Based</strong> <strong>Reinforcement</strong> <strong>Learning</strong>


Outline<strong>Learning</strong> with Hierarchies of <strong>Model</strong>sThe R-MAXQ AlgorithmSummary<strong>Learning</strong> in Structured EnvironmentsMAXQ Decomposition1 <strong>Learning</strong> with Hierarchies of <strong>Model</strong>s<strong>Learning</strong> in Structured EnvironmentsMAXQ Decomposition2 The R-MAXQ AlgorithmR-MAX ExplorationResultsNicholas K. Jong, Peter Stone<strong>Hierarchical</strong> <strong>Model</strong>-<strong>Based</strong> <strong>Reinforcement</strong> <strong>Learning</strong>


Introduction<strong>Learning</strong> with Hierarchies of <strong>Model</strong>sThe R-MAXQ AlgorithmSummary<strong>Learning</strong> in Structured EnvironmentsMAXQ DecompositionProblem Learn behaviors in unknown environmentsCriterion Minimize number of suboptimal actions takenIdea 1 <strong>Model</strong>-<strong>Based</strong> <strong>Reinforcement</strong> <strong>Learning</strong>Probabilistic finite-time convergenceEfficient use of sample dataRobust exploration using model uncertaintyIdea 2 <strong>Hierarchical</strong> <strong>Reinforcement</strong> <strong>Learning</strong>Intuitive approach to scaling to large problemsDecomposition of tasks into subtasksOur ContributionIntegration of model-based and hierarchical RL for fullystochastic, finite problemsNicholas K. Jong, Peter Stone<strong>Hierarchical</strong> <strong>Model</strong>-<strong>Based</strong> <strong>Reinforcement</strong> <strong>Learning</strong>


The Taxi Domain<strong>Learning</strong> with Hierarchies of <strong>Model</strong>sThe R-MAXQ AlgorithmSummary<strong>Learning</strong> in Structured EnvironmentsMAXQ DecompositionState Variablesx coordinateActionsy coordinatePassenger location(at 1 of 4 landmarks or in the taxi)Destination location(at 1 of 4 landmarks)North, South, East, West, PickUp, PutDownDPNicholas K. Jong, Peter Stone<strong>Hierarchical</strong> <strong>Model</strong>-<strong>Based</strong> <strong>Reinforcement</strong> <strong>Learning</strong>


<strong>Learning</strong> with Hierarchies of <strong>Model</strong>sThe R-MAXQ AlgorithmSummaryThe Taxi Hierarchy<strong>Learning</strong> in Structured EnvironmentsMAXQ Decompositionputdownpick upOptimal policyNavigate to the passengerPick up the passengerNavigate to the destinationPut down the passengerComposite actionsSet of child actions A iSet of terminal states T i ⊆ SGoal rewards ˜R i : T i → RNicholas K. Jong, Peter Stone<strong>Hierarchical</strong> <strong>Model</strong>-<strong>Based</strong> <strong>Reinforcement</strong> <strong>Learning</strong>


<strong>Learning</strong> with Hierarchies of <strong>Model</strong>sThe R-MAXQ AlgorithmSummaryThe Taxi Hierarchy<strong>Learning</strong> in Structured EnvironmentsMAXQ DecompositionpickupputdownGETROOTpick upNAVIGATETO REDnorth south eastPUTputdownwestOptimal policyNavigate to the passengerPick up the passengerNavigate to the destinationPut down the passengerComposite actionsSet of child actions A iSet of terminal states T i ⊆ SGoal rewards ˜R i : T i → RNicholas K. Jong, Peter Stone<strong>Hierarchical</strong> <strong>Model</strong>-<strong>Based</strong> <strong>Reinforcement</strong> <strong>Learning</strong>


<strong>Learning</strong> with Hierarchies of <strong>Model</strong>sThe R-MAXQ AlgorithmSummaryThe Taxi Hierarchy<strong>Learning</strong> in Structured EnvironmentsMAXQ DecompositionpickupputdownGETROOTpick upNAVIGATETO REDnorth south eastPUTputdownwestOptimal policyNavigate to the passengerPick up the passengerNavigate to the destinationPut down the passengerComposite actionsSet of child actions A iSet of terminal states T i ⊆ SGoal rewards ˜R i : T i → RNicholas K. Jong, Peter Stone<strong>Hierarchical</strong> <strong>Model</strong>-<strong>Based</strong> <strong>Reinforcement</strong> <strong>Learning</strong>


Outline<strong>Learning</strong> with Hierarchies of <strong>Model</strong>sThe R-MAXQ AlgorithmSummary<strong>Learning</strong> in Structured EnvironmentsMAXQ Decomposition1 <strong>Learning</strong> with Hierarchies of <strong>Model</strong>s<strong>Learning</strong> in Structured EnvironmentsMAXQ Decomposition2 The R-MAXQ AlgorithmR-MAX ExplorationResultsNicholas K. Jong, Peter Stone<strong>Hierarchical</strong> <strong>Model</strong>-<strong>Based</strong> <strong>Reinforcement</strong> <strong>Learning</strong>


<strong>Learning</strong> with Hierarchies of <strong>Model</strong>sThe R-MAXQ AlgorithmSummary<strong>Learning</strong> in Structured EnvironmentsMAXQ DecompositionMAXQ Decomposition of the Value FunctionDecompose value functionV i (s) = max a Q i (s, a)Total expected reward (for action i)Q i (s, a) = V a (s) + C i (s, a)Reward if i executes a firstC i [(s, a) = E k,s ′ γ k V i (s ′ ) ]Reward i expects after executing aDPGetRootGreenSouthNicholas K. Jong, Peter Stone<strong>Hierarchical</strong> <strong>Model</strong>-<strong>Based</strong> <strong>Reinforcement</strong> <strong>Learning</strong>


<strong>Learning</strong> with Hierarchies of <strong>Model</strong>sThe R-MAXQ AlgorithmSummary<strong>Learning</strong> in Structured EnvironmentsMAXQ DecompositionMAXQ Decomposition of the Value FunctionDecompose value functionV i (s) = max a Q i (s, a)Total expected reward (for action i)Q i (s, a) = V a (s) + C i (s, a)Reward if i executes a firstC i [(s, a) = E k,s ′ γ k V i (s ′ ) ]Reward i expects after executing aDPVQRootGetGreenV Root DP() = Q Root DP( , Get)SouthNicholas K. Jong, Peter Stone<strong>Hierarchical</strong> <strong>Model</strong>-<strong>Based</strong> <strong>Reinforcement</strong> <strong>Learning</strong>


<strong>Learning</strong> with Hierarchies of <strong>Model</strong>sThe R-MAXQ AlgorithmSummary<strong>Learning</strong> in Structured EnvironmentsMAXQ DecompositionMAXQ Decomposition of the Value FunctionDecompose value functionV i (s) = max a Q i (s, a)Total expected reward (for action i)Q i (s, a) = V a (s) + C i (s, a)Reward if i executes a firstC i [(s, a) = E k,s ′ γ k V i (s ′ ) ]Reward i expects after executing aDPVVQRootCGetGreenV Root DP() = Q Root DP( , Get)= V Get DP() + C Root DP( , Get)SouthNicholas K. Jong, Peter Stone<strong>Hierarchical</strong> <strong>Model</strong>-<strong>Based</strong> <strong>Reinforcement</strong> <strong>Learning</strong>


<strong>Learning</strong> with Hierarchies of <strong>Model</strong>sThe R-MAXQ AlgorithmSummary<strong>Learning</strong> in Structured EnvironmentsMAXQ DecompositionMAXQ Decomposition of the Value FunctionDecompose value functionV i (s) = max a Q i (s, a)Total expected reward (for action i)Q i (s, a) = V a (s) + C i (s, a)Reward if i executes a firstC i [(s, a) = E k,s ′ γ k V i (s ′ ) ]Reward i expects after executing aDPVQVQRootCGetCVGreenV Root DP() = Q Root DP( , Get)= V Get DP(= V South DP(+C Get DP() + C Root DP( , Get)) + C Green DP( , South), Green) + C Root DP( , Get)VQSouthCNicholas K. Jong, Peter Stone<strong>Hierarchical</strong> <strong>Model</strong>-<strong>Based</strong> <strong>Reinforcement</strong> <strong>Learning</strong>


<strong>Learning</strong> with Hierarchies of <strong>Model</strong>sThe R-MAXQ AlgorithmSummary<strong>Model</strong> Decomposition<strong>Learning</strong> in Structured EnvironmentsMAXQ DecompositionMAXQ <strong>Model</strong> Decomposition1 Learn models of primitive actions2 Plan using existing modelsDPVQRootC3 Compute abstract model4 Apply inductionVQGetCR a (s) = R πa (s) (s) + ∑s ′ ∈S\T a P πa (s) (s, s ′ )R a (s ′ )P a (s, x) = P πa (s) (s, x) + ∑s ′ ∈S\T a P πa (s) (s, s ′ )P a (s ′ , x)VV GreenQ CSouthNicholas K. Jong, Peter Stone<strong>Hierarchical</strong> <strong>Model</strong>-<strong>Based</strong> <strong>Reinforcement</strong> <strong>Learning</strong>


<strong>Learning</strong> with Hierarchies of <strong>Model</strong>sThe R-MAXQ AlgorithmSummary<strong>Model</strong> Decomposition<strong>Learning</strong> in Structured EnvironmentsMAXQ DecompositionMAXQ <strong>Model</strong> Decomposition1 Learn models of primitive actions2 Plan using existing models3 Compute abstract model4 Apply inductionDPGetRootR a (s) = R πa (s) (s) + ∑s ′ ∈S\T a P πa (s) (s, s ′ )R a (s ′ )P a (s, x) = P πa (s) (s, x) + ∑s ′ ∈S\T a P πa (s) (s, s ′ )P a (s ′ , x)R,PSouthGreenNicholas K. Jong, Peter Stone<strong>Hierarchical</strong> <strong>Model</strong>-<strong>Based</strong> <strong>Reinforcement</strong> <strong>Learning</strong>


<strong>Learning</strong> with Hierarchies of <strong>Model</strong>sThe R-MAXQ AlgorithmSummary<strong>Model</strong> Decomposition<strong>Learning</strong> in Structured EnvironmentsMAXQ DecompositionMAXQ <strong>Model</strong> Decomposition1 Learn models of primitive actions2 Plan using existing models3 Compute abstract model4 Apply inductionDPGetRootR a (s) = R πa (s) (s) + ∑s ′ ∈S\T a P πa (s) (s, s ′ )R a (s ′ )P a (s, x) = P πa (s) (s, x) + ∑s ′ ∈S\T a P πa (s) (s, s ′ )P a (s ′ , x)R,PGreenπQSouthNicholas K. Jong, Peter Stone<strong>Hierarchical</strong> <strong>Model</strong>-<strong>Based</strong> <strong>Reinforcement</strong> <strong>Learning</strong>


<strong>Learning</strong> with Hierarchies of <strong>Model</strong>sThe R-MAXQ AlgorithmSummary<strong>Model</strong> Decomposition<strong>Learning</strong> in Structured EnvironmentsMAXQ DecompositionMAXQ <strong>Model</strong> Decomposition1 Learn models of primitive actions2 Plan using existing models3 Compute abstract model4 Apply inductionDPGetRootR a (s) = R πa (s) (s) + ∑s ′ ∈S\T a P πa (s) (s, s ′ )R a (s ′ )P a (s, x) = P πa (s) (s, x) + ∑s ′ ∈S\T a P πa (s) (s, s ′ )P a (s ′ , x)R,PR,P GreenπQSouthNicholas K. Jong, Peter Stone<strong>Hierarchical</strong> <strong>Model</strong>-<strong>Based</strong> <strong>Reinforcement</strong> <strong>Learning</strong>


<strong>Learning</strong> with Hierarchies of <strong>Model</strong>sThe R-MAXQ AlgorithmSummary<strong>Model</strong> Decomposition<strong>Learning</strong> in Structured EnvironmentsMAXQ DecompositionMAXQ <strong>Model</strong> Decomposition1 Learn models of primitive actions2 Plan using existing modelsDPQRootπ3 Compute abstract model4 Apply inductionR,PQGetπR a (s) = R πa (s) (s) + ∑s ′ ∈S\T a P πa (s) (s, s ′ )R a (s ′ )P a (s, x) = P πa (s) (s, x) + ∑s ′ ∈S\T a P πa (s) (s, s ′ )P a (s ′ , x)R,PR,P GreenπQSouthNicholas K. Jong, Peter Stone<strong>Hierarchical</strong> <strong>Model</strong>-<strong>Based</strong> <strong>Reinforcement</strong> <strong>Learning</strong>


Outline<strong>Learning</strong> with Hierarchies of <strong>Model</strong>sThe R-MAXQ AlgorithmSummaryR-MAX ExplorationResults1 <strong>Learning</strong> with Hierarchies of <strong>Model</strong>s<strong>Learning</strong> in Structured EnvironmentsMAXQ Decomposition2 The R-MAXQ AlgorithmR-MAX ExplorationResultsNicholas K. Jong, Peter Stone<strong>Hierarchical</strong> <strong>Model</strong>-<strong>Based</strong> <strong>Reinforcement</strong> <strong>Learning</strong>


<strong>Learning</strong> with Hierarchies of <strong>Model</strong>sThe R-MAXQ AlgorithmSummaryR-MAX ExplorationResultsR-MAX <strong>Model</strong>s of Primitive ActionsMaximum-likelihood estimation, given sufficient dataR a (s) =total reward# of transitionsP a (s, s ′ # of transitions to s′) =# of transitionsOptimistic models, given insufficient dataR a (s) = V max P a (s, s ′ ) = 0pickupNicholas K. Jong, Peter Stone<strong>Hierarchical</strong> <strong>Model</strong>-<strong>Based</strong> <strong>Reinforcement</strong> <strong>Learning</strong>


<strong>Learning</strong> with Hierarchies of <strong>Model</strong>sThe R-MAXQ AlgorithmSummaryR-MAX ExplorationResultsR-MAX <strong>Model</strong>s of Primitive ActionsMaximum-likelihood estimation, given sufficient dataR a (s) =total reward# of transitionsP a (s, s ′ # of transitions to s′) =# of transitionsOptimistic models, given insufficient dataR a (s) = V max P a (s, s ′ ) = 0PpickupNicholas K. Jong, Peter Stone<strong>Hierarchical</strong> <strong>Model</strong>-<strong>Based</strong> <strong>Reinforcement</strong> <strong>Learning</strong>


<strong>Learning</strong> with Hierarchies of <strong>Model</strong>sThe R-MAXQ AlgorithmSummaryR-MAX ExplorationResultsR-MAX <strong>Model</strong>s of Primitive ActionsMaximum-likelihood estimation, given sufficient dataR a (s) =total reward# of transitionsP a (s, s ′ # of transitions to s′) =# of transitionsOptimistic models, given insufficient dataR a (s) = V max P a (s, s ′ ) = 0putdownDPPpickupNicholas K. Jong, Peter Stone<strong>Hierarchical</strong> <strong>Model</strong>-<strong>Based</strong> <strong>Reinforcement</strong> <strong>Learning</strong>


<strong>Learning</strong> with Hierarchies of <strong>Model</strong>sThe R-MAXQ AlgorithmSummaryThe R-MAX AlgorithmR-MAX ExplorationResultsProcedure for each time step1 Update model2 Compute value function3 Choose greedy actionThorough exploration due toinitial optimismVery large negative rewards inexploratory episodesHigh-quality policy after initialexplorationReward per episodeCumulative reward0-50-100-150R-MAXMAXQ-Q-2000 100 200 300 400 500 600 700 800Episodes0-10000-20000-30000-40000R-MAXMAXQ-Q-500000 100 200 300 400 500 600 700 800EpisodesNicholas K. Jong, Peter Stone<strong>Hierarchical</strong> <strong>Model</strong>-<strong>Based</strong> <strong>Reinforcement</strong> <strong>Learning</strong>


<strong>Learning</strong> with Hierarchies of <strong>Model</strong>sThe R-MAXQ AlgorithmSummaryThe R-MAXQ AlgorithmR-MAX ExplorationResultsProcedure for each time step1 Update R-MAX primitive models2 Compute MAXQ composite models3 Resume executing hierarchical policyDPR,PQQRootπGetπPropagates optimism up hierarchyMemoizes models across time stepsEmploys prioritized sweepingR,PR,P GreenπQSouthNicholas K. Jong, Peter Stone<strong>Hierarchical</strong> <strong>Model</strong>-<strong>Based</strong> <strong>Reinforcement</strong> <strong>Learning</strong>


<strong>Learning</strong> with Hierarchies of <strong>Model</strong>sThe R-MAXQ AlgorithmSummaryThe R-MAXQ AlgorithmR-MAX ExplorationResultsDRootProcedure for each time step1 Update R-MAX primitive models2 Compute MAXQ composite models3 Resume executing hierarchical policyPMAXQMAXQGetPropagates optimism up hierarchyMemoizes models across time stepsEmploys prioritized sweepingMAXQSouthR−maxGreenNicholas K. Jong, Peter Stone<strong>Hierarchical</strong> <strong>Model</strong>-<strong>Based</strong> <strong>Reinforcement</strong> <strong>Learning</strong>


Outline<strong>Learning</strong> with Hierarchies of <strong>Model</strong>sThe R-MAXQ AlgorithmSummaryR-MAX ExplorationResults1 <strong>Learning</strong> with Hierarchies of <strong>Model</strong>s<strong>Learning</strong> in Structured EnvironmentsMAXQ Decomposition2 The R-MAXQ AlgorithmR-MAX ExplorationResultsNicholas K. Jong, Peter Stone<strong>Hierarchical</strong> <strong>Model</strong>-<strong>Based</strong> <strong>Reinforcement</strong> <strong>Learning</strong>


<strong>Learning</strong> with Hierarchies of <strong>Model</strong>sThe R-MAXQ AlgorithmSummaryExperimental SetupR-MAX ExplorationResultsEnvironment: stochastic TaxiMAXQ-QReplication of Dietterich’s original algorithmBoltzmann explorationParameters from Dietterich’s implementationR-MAX primitive modelsEach state-action optimistic until sample size m = 5Planning with value iteration until ɛ = 0.001State abstraction:MAXQ-Q: All of Dietterich’s abstractionsR-MAX: Max-node irrelevance for each primitive modelExample:South ignores Passenger and DestinationR-MAXQ: Also max-node irrelevance for abstract modelsExample: Get ignores DestinationNicholas K. Jong, Peter Stone<strong>Hierarchical</strong> <strong>Model</strong>-<strong>Based</strong> <strong>Reinforcement</strong> <strong>Learning</strong>


Empirical Results<strong>Learning</strong> with Hierarchies of <strong>Model</strong>sThe R-MAXQ AlgorithmSummaryR-MAX ExplorationResultsReward per episodeCumulative reward200-20-40-60-80-100R-MAXQR-MAXMAXQ-Q-1200 100 200 300 400 500 600 700 800Episodes0-10000-20000-30000-40000R-MAXQR-MAXMAXQ-Q-500000 100 200 300 400 500 600 700 800EpisodesR-MAXQ learning curvedominates MAXQ-Q curveR-MAXQ converges to sameasymptote as R-MAXR-MAXQ avoids most of thecostly exploration of R-MAXNicholas K. Jong, Peter Stone<strong>Hierarchical</strong> <strong>Model</strong>-<strong>Based</strong> <strong>Reinforcement</strong> <strong>Learning</strong>


<strong>Learning</strong> with Hierarchies of <strong>Model</strong>sThe R-MAXQ AlgorithmSummaryR-MAX ExplorationResultsEager Exploration Versus Lazy ExplorationR-MAX experiments with pickup and putdown at all 50states reachable from the initial state.R-MAXQ attempts pickup (putdown) at only 5 (4)reachable states in Get (Put).R-MAXQ never attempts putdown outside the fourlandmark locations.putdownDPPpickupNicholas K. Jong, Peter Stone<strong>Hierarchical</strong> <strong>Model</strong>-<strong>Based</strong> <strong>Reinforcement</strong> <strong>Learning</strong>


<strong>Learning</strong> with Hierarchies of <strong>Model</strong>sThe R-MAXQ AlgorithmSummaryR-MAX ExplorationResultsEager Exploration Versus Lazy ExplorationR-MAX experiments with pickup and putdown at all 50states reachable from the initial state.R-MAXQ attempts pickup (putdown) at only 5 (4)reachable states in Get (Put).R-MAXQ never attempts putdown outside the fourlandmark locations.GETputdownDPUTPPpickupNicholas K. Jong, Peter Stone<strong>Hierarchical</strong> <strong>Model</strong>-<strong>Based</strong> <strong>Reinforcement</strong> <strong>Learning</strong>


<strong>Learning</strong> with Hierarchies of <strong>Model</strong>sThe R-MAXQ AlgorithmSummaryR-MAX ExplorationResultsEager Exploration Versus Lazy ExplorationR-MAX experiments with pickup and putdown at all 50states reachable from the initial state.R-MAXQ attempts pickup (putdown) at only 5 (4)reachable states in Get (Put).R-MAXQ never attempts putdown outside the fourlandmark locations.GETputdownDPUTPPpickupNicholas K. Jong, Peter Stone<strong>Hierarchical</strong> <strong>Model</strong>-<strong>Based</strong> <strong>Reinforcement</strong> <strong>Learning</strong>


<strong>Learning</strong> with Hierarchies of <strong>Model</strong>sThe R-MAXQ AlgorithmSummaryThe Role of HierarchyR-MAX ExplorationResultsImprove computational complexity (already known)Decompose tasks into smaller subtasksFewer primitive actions per subtaskExplicit state abstraction at lower levelsSmaller “completion sets” of reachable states at higherlevels (related to result distribution irrelevance)ROOTDDGETPUTPPpickupNicholas K. Jong, Peter Stone<strong>Hierarchical</strong> <strong>Model</strong>-<strong>Based</strong> <strong>Reinforcement</strong> <strong>Learning</strong>


<strong>Learning</strong> with Hierarchies of <strong>Model</strong>sThe R-MAXQ AlgorithmSummarySummaryR-MAXQ combines R-MAX’s robust exploration withMAXQ’s incorporation of hierarchical domain knowledge.With regard to sample complexity, a primary role ofhierarchy may be to constrain unnecessary exploration.Future WorkApplication to larger, even continuous, domainsGuidelines for the design or discovery of hierarchiesNicholas K. Jong, Peter Stone<strong>Hierarchical</strong> <strong>Model</strong>-<strong>Based</strong> <strong>Reinforcement</strong> <strong>Learning</strong>


AppendixThe Abstract Taxi HierarchyMore on State AbstractionTheoretical GuaranteesRootx−coordinatePassengery−coordinateDestinationPutx−coordinatePassengerx−coordinatePassengery−coordinateGety−coordinateDestinationPick Upx−coordinatey−coordinateNavigatetoRedPut DownNorth South East WestNicholas K. Jong, Peter Stone<strong>Hierarchical</strong> <strong>Model</strong>-<strong>Based</strong> <strong>Reinforcement</strong> <strong>Learning</strong>


AppendixMore on State AbstractionTheoretical GuaranteesEmpirical Results Without State AbstractionReward per episodeCumulative reward0-50-100-150-200-10000-20000-30000-40000-50000R-MAXQR-MAXMAXQ-Q0 200 400 600 800 1000 1200 1400 1600Episodes0R-MAXQR-MAXMAXQ-Q-600000 200 400 600 800 1000 1200 1400 1600EpisodesR-MAX performs slightly worse.Navigational actions require 16times as much data, since theyno longer ignore passengerlocation and destination.Pickup requires 4 times asmuch data, since it no longerignores passenger destination.R-MAXQ still benefits from neverexecuting putdown outside ofthe four landmark locations.MAXQ-Q performs poorly withoutstate abstraction.Nicholas K. Jong, Peter Stone<strong>Hierarchical</strong> <strong>Model</strong>-<strong>Based</strong> <strong>Reinforcement</strong> <strong>Learning</strong>


AppendixMore on State AbstractionTheoretical GuaranteesThe Sample Complexity of R-MAXQ IFor the same threshold amount of experience perstate-action, R-MAXQ will spend no more time exploringthat R-MAX.However, the threshold required to ensure a given level ofnear-optimality may be exponentionally worse in the heightof the hierarchy.These (weak) guarantees make no assumptions about thequality of the hierarchy! (In the same way that the R-MAXguarantees make no assumptions about the policy used totransform a bound on model error into a bound on valuefunction error.)Nicholas K. Jong, Peter Stone<strong>Hierarchical</strong> <strong>Model</strong>-<strong>Based</strong> <strong>Reinforcement</strong> <strong>Learning</strong>


AppendixMore on State AbstractionTheoretical GuaranteesThe Sample Complexity of R-MAXQ IITheoremIf m samples of each state-action guarantee that R-MAXconverges ( to an ɛ-optimal policy with probability 1 − δ, then( ) ) 2hm ′ = O m TL1−δsamples of each primitive state-actionsuffice for R-MAXQ to converge to a recursively ɛ-optimal policywith probability 1 − δ.)L is O(log ɛ1−γT is the maximum number of reachable terminal states for anycomposite actionh is the height of the hierarchyNicholas K. Jong, Peter Stone<strong>Hierarchical</strong> <strong>Model</strong>-<strong>Based</strong> <strong>Reinforcement</strong> <strong>Learning</strong>

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!