13.07.2015 Views

Page Replacement in Linux - AKA Kernel大会

Page Replacement in Linux - AKA Kernel大会

Page Replacement in Linux - AKA Kernel大会

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<strong>Page</strong> <strong>Replacement</strong> <strong>in</strong> L<strong>in</strong>uxWu FengguangOctober 22, 2009


Outl<strong>in</strong>e. .1 page types and organization. .2 page reclaim logicsWu Fengguang (Intel OTC) <strong>Page</strong> <strong>Replacement</strong> <strong>in</strong> L<strong>in</strong>ux 2009 <strong>AKA</strong> kernel conference 2 / 32


page reclaimWu Fengguang (Intel OTC) <strong>Page</strong> <strong>Replacement</strong> <strong>in</strong> L<strong>in</strong>ux 2009 <strong>AKA</strong> kernel conference 3 / 32


memory zonesWu Fengguang (Intel OTC) <strong>Page</strong> <strong>Replacement</strong> <strong>in</strong> L<strong>in</strong>ux 2009 <strong>AKA</strong> kernel conference 4 / 32


zoned page allocationDMA 24-bit addressableDMA32 32-bit addressableNORMAL direct addressableHIGHMEM addressable via (temp) mapp<strong>in</strong>g.__get_free_page(gfp_mask)..zone modifier | zone fallback order-----------------+---------------------------------------__GFP_DMA | DMA__GFP_DMA32 | DMA32 => (DMA)(unspecified) | NORMAL => DMA32 => (DMA)__GFP_HIGHMEM | HIGHMEM => NORMAL => DMA32 => (DMA).....Wu Fengguang (Intel OTC) <strong>Page</strong> <strong>Replacement</strong> <strong>in</strong> L<strong>in</strong>ux 2009 <strong>AKA</strong> kernel conference 5 / 32


zone pagesWu Fengguang (Intel OTC) <strong>Page</strong> <strong>Replacement</strong> <strong>in</strong> L<strong>in</strong>ux 2009 <strong>AKA</strong> kernel conference 6 / 32


zone pages (cont.)free pagesLRU cache (<strong>in</strong> 5 lists)file backed pages (active + <strong>in</strong>active)swap backed pages (active + <strong>in</strong>active)unevictable pagesslab cache (reclaimable)icachedcacheother pagesWu Fengguang (Intel OTC) <strong>Page</strong> <strong>Replacement</strong> <strong>in</strong> L<strong>in</strong>ux 2009 <strong>AKA</strong> kernel conference 7 / 32


file backed pagesManaged <strong>in</strong> both radix tree and file LRU lists.Wu Fengguang (Intel OTC) <strong>Page</strong> <strong>Replacement</strong> <strong>in</strong> L<strong>in</strong>ux 2009 <strong>AKA</strong> kernel conference 8 / 32


page referenced bitsread() sets <strong>Page</strong>Referenced or <strong>Page</strong>Active (activate it)mmap read sets pte_young, to be exam<strong>in</strong>ed at reclaim timeWu Fengguang (Intel OTC) <strong>Page</strong> <strong>Replacement</strong> <strong>in</strong> L<strong>in</strong>ux 2009 <strong>AKA</strong> kernel conference 9 / 32


general cache rulesIn order to stay <strong>in</strong> memory:unmapped have to be referenced 2 times when <strong>in</strong> active+<strong>in</strong>active listmapped have to be referenced 1 times when <strong>in</strong> <strong>in</strong>active listWu Fengguang (Intel OTC) <strong>Page</strong> <strong>Replacement</strong> <strong>in</strong> L<strong>in</strong>ux 2009 <strong>AKA</strong> kernel conference 10 / 32


activate rulesactivate on the 2nd read()/write() accessactivate accessed mapped pages; some unfreeable pagesre-activate accessed program textWu Fengguang (Intel OTC) <strong>Page</strong> <strong>Replacement</strong> <strong>in</strong> L<strong>in</strong>ux 2009 <strong>AKA</strong> kernel conference 11 / 32


evict rulesde-activate all unmapped pages; mapped data pagespageout all unmapped pages; not accessed mapped pagesrecycle pages put to writeback; some unreclaimable pages for nowWu Fengguang (Intel OTC) <strong>Page</strong> <strong>Replacement</strong> <strong>in</strong> L<strong>in</strong>ux 2009 <strong>AKA</strong> kernel conference 12 / 32


alanced ag<strong>in</strong>g: whyOn sequential read, need a small (24kb) I/O to fill the hole.Wu Fengguang (Intel OTC) <strong>Page</strong> <strong>Replacement</strong> <strong>in</strong> L<strong>in</strong>ux 2009 <strong>AKA</strong> kernel conference 13 / 32


alanced ag<strong>in</strong>g: howfor ratio <strong>in</strong> 14096 , 12048 , 11024 , . . . , 12 , 1for each zonescan (zone_nr_lru_pages * ratio) pages;if enough pages reclaimedbreak;Wu Fengguang (Intel OTC) <strong>Page</strong> <strong>Replacement</strong> <strong>in</strong> L<strong>in</strong>ux 2009 <strong>AKA</strong> kernel conference 14 / 32


alanced ag<strong>in</strong>g: overscan of high zonesproblemic scenario:the 16MB DMA zone is cycled once per second=>the 500MB NORMAL zone is cycled once per second=>all pages not referenced with<strong>in</strong> 1 second are evicted=>huge nr_free_pagesWu Fengguang (Intel OTC) <strong>Page</strong> <strong>Replacement</strong> <strong>in</strong> L<strong>in</strong>ux 2009 <strong>AKA</strong> kernel conference 15 / 32


alanced ag<strong>in</strong>g: how 2for ratio <strong>in</strong> 14096 , 12048 , 11024 , . . . , 12 , 1for each zone from target zone to ZONE_DMAscan (zone_nr_lru_pages * ratio) pages;if enough pages reclaimedbreak;Wu Fengguang (Intel OTC) <strong>Page</strong> <strong>Replacement</strong> <strong>in</strong> L<strong>in</strong>ux 2009 <strong>AKA</strong> kernel conference 16 / 32


alanced ag<strong>in</strong>g: overscan of lower zonesWu Fengguang (Intel OTC) <strong>Page</strong> <strong>Replacement</strong> <strong>in</strong> L<strong>in</strong>ux 2009 <strong>AKA</strong> kernel conference 17 / 32


alanced ag<strong>in</strong>g: low mem reserveWu Fengguang (Intel OTC) <strong>Page</strong> <strong>Replacement</strong> <strong>in</strong> L<strong>in</strong>ux 2009 <strong>AKA</strong> kernel conference 18 / 32


alanced ag<strong>in</strong>g: over reclaimfor ratio <strong>in</strong> 14096 , 12048 , 11024 , . . . , 12 , 1for each zone from target zone to lowest zonescan (zone_nr_lru_pages * ratio) pagesif enough pages reclaimedbreak;Wu Fengguang (Intel OTC) <strong>Page</strong> <strong>Replacement</strong> <strong>in</strong> L<strong>in</strong>ux 2009 <strong>AKA</strong> kernel conference 19 / 32


alanced ag<strong>in</strong>g: early breakSo we reach the current 2.6.32 logic. Problems of the loops:risks overscann<strong>in</strong>g lower zonesrisks overscann<strong>in</strong>g small zonesfor ratio <strong>in</strong> 14096 , 12048 , 11024 , . . . , 12 , 1for each zone from target zone to lowest zonescan (zone_nr_lru_pages * ratio) pages,until enough pages reclaimed && ...if enough pages reclaimedbreak;Wu Fengguang (Intel OTC) <strong>Page</strong> <strong>Replacement</strong> <strong>in</strong> L<strong>in</strong>ux 2009 <strong>AKA</strong> kernel conference 20 / 32


alanced ag<strong>in</strong>g: NUMA nodesPrimary goals:avoid off node allocation, orspread NUMA <strong>in</strong>terl<strong>in</strong>k traffic evenlyno explicit efforts for balanced scan of NUMA nodesag<strong>in</strong>g rates are heavily impacted by page allocation policyNUMA memory policies: <strong>in</strong>terleave, b<strong>in</strong>d, preferred, defaultWu Fengguang (Intel OTC) <strong>Page</strong> <strong>Replacement</strong> <strong>in</strong> L<strong>in</strong>ux 2009 <strong>AKA</strong> kernel conference 21 / 32


zone reclaim <strong>in</strong> a NUMA systemmemory is exhausted locally => off node allocation of anonymous pagessource: Local and Remote Memory: Memory <strong>in</strong> a L<strong>in</strong>ux/NUMA System, Christoph Lameterzone reclaim:start reclaim<strong>in</strong>g (unmapped) pages if a zone’s free memory is lowWu Fengguang (Intel OTC) <strong>Page</strong> <strong>Replacement</strong> <strong>in</strong> L<strong>in</strong>ux 2009 <strong>AKA</strong> kernel conference 22 / 32


direct reclaim <strong>in</strong> a busy systemkswapd reclaim kswapd frees memory <strong>in</strong> the background, so that lightpage allocations can complete <strong>in</strong>stantly;direct reclaim when kswapd cannot keep up with the allocation rate,applications try to free memory for themselves.helps throttle heavy allocation appsNUMA: focus reclaim on allocation heavy nodesmay isolate too many LRU pages on fork bombsWu Fengguang (Intel OTC) <strong>Page</strong> <strong>Replacement</strong> <strong>in</strong> L<strong>in</strong>ux 2009 <strong>AKA</strong> kernel conference 23 / 32


eclaim scalability: unevictable listto avoid<strong>in</strong>g scann<strong>in</strong>g the unreclaimable pages:pages p<strong>in</strong>ned by mlock() or SHM_LOCKmemory backed pages (ramfs, ramdisk)Wu Fengguang (Intel OTC) <strong>Page</strong> <strong>Replacement</strong> <strong>in</strong> L<strong>in</strong>ux 2009 <strong>AKA</strong> kernel conference 24 / 32


eclaim scalability: mapped pageswe used to avoid scann<strong>in</strong>g and evict<strong>in</strong>g mapped pagesmapped pages are more costly to scan/reclaim/refaultmapped pages are more valuable to cache <strong>in</strong> typical desktopnow tend to de-activate mapped pages unconditionallylong round trip time (eg. 1TB mem)every page have pte_youngVM_EXEC mapped file pages are still protected as usual(for better desktop responsiveness)Wu Fengguang (Intel OTC) <strong>Page</strong> <strong>Replacement</strong> <strong>in</strong> L<strong>in</strong>ux 2009 <strong>AKA</strong> kernel conference 25 / 32


eclaim efficiency: lumpy reclaimrational excessive reclaims for high-order page allocationsolution lumpy reclaim for high-order page:.1 the first page is taken from LRU list (as before),. .2 try to reclaim the (physically) surround<strong>in</strong>g pages(to form a high-order block)problem distort LRU; randomly punch<strong>in</strong>g holes <strong>in</strong> cached filesWu Fengguang (Intel OTC) <strong>Page</strong> <strong>Replacement</strong> <strong>in</strong> L<strong>in</strong>ux 2009 <strong>AKA</strong> kernel conference 26 / 32


split file/anon LRU listsfile LRU file backed pagesanon LRU mem/swap backed pagesanonymous pagestmpfs pagesscan ratio ∝ recent_scanned : recent_referenced(more referenced, less scanned)protects anonymous pagesimproves reclaim efficiencyWu Fengguang (Intel OTC) <strong>Page</strong> <strong>Replacement</strong> <strong>in</strong> L<strong>in</strong>ux 2009 <strong>AKA</strong> kernel conference 27 / 32


anon LRU active:<strong>in</strong>active ratiobalanced scan <strong>in</strong> normalshr<strong>in</strong>k active list iif <strong>in</strong>active list is low<strong>in</strong>active list serves as the only grace w<strong>in</strong>dow for active referencesnote that when <strong>in</strong>active list => 0, LRU reduce to FIFOactive:<strong>in</strong>active => sqrt(10 * GB)Wu Fengguang (Intel OTC) <strong>Page</strong> <strong>Replacement</strong> <strong>in</strong> L<strong>in</strong>ux 2009 <strong>AKA</strong> kernel conference 28 / 32


file LRU active:<strong>in</strong>active ratiobalanced scan <strong>in</strong> normalshr<strong>in</strong>k active list only when <strong>in</strong>active list is lowactive list: work<strong>in</strong>g set<strong>in</strong>active list: use-once dataprotect work<strong>in</strong>g set from be<strong>in</strong>g flushed by use-once dataactive:<strong>in</strong>active => 1:1Wu Fengguang (Intel OTC) <strong>Page</strong> <strong>Replacement</strong> <strong>in</strong> L<strong>in</strong>ux 2009 <strong>AKA</strong> kernel conference 29 / 32


advanced page replacementcurrent logic: balanced but not optimalobvious optimization: stream<strong>in</strong>g IOdrop beh<strong>in</strong>duse frequency <strong>in</strong>stead of recencyavailable solutionsCLOCK-Pro, CART, CAR, ARC, LIRS etc.problemscomplexitypossible regressionsWu Fengguang (Intel OTC) <strong>Page</strong> <strong>Replacement</strong> <strong>in</strong> L<strong>in</strong>ux 2009 <strong>AKA</strong> kernel conference 30 / 32


ead<strong>in</strong>gs and resourcesL<strong>in</strong>uxMMhttp://l<strong>in</strong>ux-mm.org/l<strong>in</strong>ux-mm mail<strong>in</strong>g listhttp://marc.<strong>in</strong>fo/?l=l<strong>in</strong>ux-mml<strong>in</strong>ux kernel source code/usr/src/l<strong>in</strong>ux $ git log mmbook Understand<strong>in</strong>g The L<strong>in</strong>ux Virtual Memory Managerhttp://www.skynet.ie/~mel/projects/vm/guide/pdf/understand.pdfWu Fengguang (Intel OTC) <strong>Page</strong> <strong>Replacement</strong> <strong>in</strong> L<strong>in</strong>ux 2009 <strong>AKA</strong> kernel conference 31 / 32


Thank you!Wu Fengguang (Intel OTC) <strong>Page</strong> <strong>Replacement</strong> <strong>in</strong> L<strong>in</strong>ux 2009 <strong>AKA</strong> kernel conference 32 / 32

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!