13.07.2015 Views

Performance Analysis and Tuning – Part 1 - Red Hat Summit

Performance Analysis and Tuning – Part 1 - Red Hat Summit

Performance Analysis and Tuning – Part 1 - Red Hat Summit

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<strong>Performance</strong> <strong>Analysis</strong> <strong>and</strong><strong>Tuning</strong> – <strong>Part</strong> 1Larry WoodmanSenior Consulting Engineering RHEL/VMBill GrayPrincipal <strong>Performance</strong> Engineer, <strong>Red</strong> <strong>Hat</strong>June 13, 2013


Agenda: <strong>Performance</strong> <strong>Analysis</strong> <strong>Tuning</strong> <strong>Part</strong> II●●●<strong>Part</strong> I●●●●<strong>Red</strong> <strong>Hat</strong> Enterprise Linux “tuned” profiles, topbenchmark results<strong>Part</strong> II●●●Q & AScalabilty – CFS Scheduler tunables / CgroupsHugepages – Transparent Hugepages, 2MB/1GBNon Uniform Memory Access (NUMA) <strong>and</strong> NUMADNetwork <strong>Performance</strong> <strong>and</strong> Latency-performanceDisk <strong>and</strong> Filesystem IO - Throughput-performanceSystem <strong>Performance</strong>/Tools – perf, tuna, systemtap


<strong>Red</strong> <strong>Hat</strong> Enterprise Linux: Scale Up & OutTraditional scale-out capabilities have been complemented overthe past five years with scale-up capabilitiesBrings open source value <strong>and</strong> flexibility to x86_64 Server marketUp to4096 CPUsSupport for scalable architecturesMulti-core <strong>and</strong> hyperthreadingKernel NUMA <strong>and</strong> SMP enhancementsScale Up1 CPU1 node Scale Out 1000s nodes


<strong>Red</strong> <strong>Hat</strong> Enterprise Linux 6Benchmark platform of choiceSurvey of Benchmark Results 6/2013●System <strong>Performance</strong> Evaluation Committee● www.spec.org● SPECcpu2006● SPECvirt_sc2010, sc2013● SPECjbb2013●Transaction Processing Council (TPC)● www.tpc.org● TPC-H 3 of top 6 categories● TPC-C Top virtualization w/ IBM●STAC FSI workloads - www.stacresearch.com●SAP Sales <strong>and</strong> Distribution -www.sap.com/campaigns/benchmark


<strong>Red</strong> <strong>Hat</strong> Enterprise Linux 6Benchmark platform of choiceSPEC Benchmark Publications 2011 - 2013120%Percentage using RHEL6 (as of June 1, 2013)100%100%80%79%80%86%60%60%50%45%40%38%20%0%6%0% 0% 0% 0% 0% 0%cpu2006 virt_sc2010 jEnterprise2010 virt_sc2013 jbb2013SPEC® is a registered trademark of the St<strong>and</strong>ard <strong>Performance</strong> EvaluationCorporation. For more information about SPEC <strong>and</strong> it's benchmarkssee www.spec.org2011 2012 2013


<strong>Red</strong> <strong>Hat</strong> Enterprise Linux 6Benchmark platform of choiceTPC Benchmark Publications 2011 - 2013120%Percentage Using RHEL6 (as of June 1, 2013)100%100%80%67%60%50%40%33%20%0%For more information about the TPC <strong>and</strong> it'sbenchmarks see www.tpc.org.TPC-CTPC-H2011 2012 2013


<strong>Red</strong> <strong>Hat</strong> Enterprise Linux 6.4 vs Windows Server 2012 LINPACK8 Principled Technologies, Inc. & <strong>Red</strong> <strong>Hat</strong>, Inc. – Confidential05/28/13


tuned Profile Comparison Matrixkernel.sched_min_granularity_ns4ms 10ms 10ms 10ms 10msTunable default enterprisestoragevirtualhostvirtualguestlatencyperformancethroughputperformancekernel.sched_wakeup_granularity_ns4ms 15ms 15ms 15ms 15msvm.dirty_ratio 20% RAM 40% 10% 40% 40%vm.dirty_background_ratio10% RAM 5%vm.swappiness 60 10 30I/O Scheduler(Elevator)CFQ deadline deadline deadline deadline deadlineFilesystem Barriers On Off Off OffCPU Governor ondem<strong>and</strong> performance performance performanceDisk Read-ahead4xDisable THPDisable C-StatesYesYeshttps://access.redhat.com/site/solutions/369093


<strong>Red</strong> <strong>Hat</strong> Enterprise Linux 6Scheduler TunablesImplements multilevel run queuesfor sockets <strong>and</strong> cores (asopposed to one run queueper processor or per system)RHEL6 tunables●sched_min_granularity_ns●sched_wakeup_granularity_ns●sched_migration_cost●sched_child_runs_first●sched_latency_nsSocket 0Core 0Thread 0 Thread 1Core 1Thread 0 Thread 1ProcessProcessSocket 1Thread 0 Thread 1ProcessProcessSocket 2ProcessProcessProcessProcessProcessProcessProcessProcessScheduler Compute Queues


Finer grained scheduler tuning●●/proc/sys/kernel/sched_*<strong>Red</strong> <strong>Hat</strong> Enterprise Linux 6 Tuned-adm will increase quantum onpar with <strong>Red</strong> <strong>Hat</strong> Enterprise Linux 5●●echo 10000000 > /proc/sys/kernel/sched_min_granularity_ns●Minimal preemption granularity for CPU bound tasks. Seesched_latency_ns for details. The default value is 4000000(ns).echo 15000000 > /proc/sys/kernel/sched_wakeup_granularity_ns●The wake-up preemption granularity. Increasing this variablereduces wake-up preemption, reducing disturbance ofcompute bound tasks. Lowering it improves wake-up latency<strong>and</strong> throughput for latency critical tasks, particularly when ashort duty cycle load component must compete with CPUbound components. The default value is 5000000 (ns).


Load Balancing●●●●●Scheduler tries to keep all CPUs busy by moving tasks formoverloaded CPUs to idle CPUsDetect using “perf stat”, look for excessive “migrations”/proc/sys/kernel/sched_migration_cost●●Amount of time after the last execution that a task is consideredto be “cache hot” in migration decisions. A “hot” task is less likelyto be migrated, so increasing this variable reduces taskmigrations. The default value is 500000 (ns).If the CPU idle time is higher than expected when there arerunnable processes, try reducing this value. If tasks bouncebetween CPUs or nodes too often, try increasing it.Rule of thumb – increase by 2-10x to reduce load balancingIncrease by 10x on large systems when many CGROUPs areactively used (ex: RHEV/ KVM/RHOS)


Sched_Migration CostRHEL6.3 Effect of sched_migration cost on fork/exitIntel Westmere EP 24cpu/12core, 24 GB mem250.00140.00%200.00120.00%100.00%usec/call150.00100.0080.00%60.00%Percentusec/call default 500ususec/call tuned 4mspercent improvement40.00%50.0020.00%0.00exit_10 exit_100 exit_1000 fork_10 fork_100 fork_10000.00%


sched_child_runs_first●●●fork() behaviorControls whether parent or child runs firstDefault is 0: parent continues before children run.Default is different than RHEL5


●St<strong>and</strong>ard HugePages 2MB<strong>Red</strong> <strong>Hat</strong> Enterprise Linux 6Hugepages/ VM <strong>Tuning</strong>●Reserve/free via●/proc/sys/vm/nr_hugepages●/sys/devices/node/*/hugepages/*/nrhugepagesTLB●Used via hugetlbfs●GB Hugepages 1GB●●●Reserved at boot time/no freeingUsed via hugetlbfsTransparent HugePages 2MB128 data128 instructionPhysical Memory●On by default via boot args or /sysVirtual AddressSpace●Used for anonymous memory


2MB st<strong>and</strong>ard Hugepages# echo 2000 > /proc/sys/vm/nr_hugepages# cat /proc/meminfoMemTotal: 16331124 kBMemFree: 11788608 kBHugePages_Total: 2000HugePages_Free: 2000HugePages_Rsvd: 0HugePages_Surp: 0Hugepagesize: 2048 kB# ./hugeshm 1000# cat /proc/meminfoMemTotal: 16331124 kBMemFree: 11788608 kBHugePages_Total: 2000HugePages_Free: 1000HugePages_Rsvd: 1000HugePages_Surp: 0Hugepagesize: 2048 kB


# cat /proc/meminfo | moreHugePages_Total: 8HugePages_Free: 8HugePages_Rsvd: 0HugePages_Surp: 01GB HugepagesBoot arguments -● default_hugepagesz=1G, hugepagesz=1G, hugepages=8#mount -t hugetlbfs none /mnt# ./mmapwrite /mnt/junk 33writing 2097152 pages of r<strong>and</strong>om junk to file /mnt/junkwrote 8589934592 bytes to file /mnt/junk# cat /proc/meminfo | moreHugePages_Total: 8HugePages_Free: 0HugePages_Rsvd: 0HugePages_Surp: 0


Transparent Hugepagesecho never > /sys/kernel/mm/transparent_hugepages=never[root@dhcp-100-19-50 code]# time ./memory 15 0real 0m12.434suser 0m0.936ssys 0m11.416s# cat /proc/meminfoMemTotal: 16331124 kBAnonHugePages: 0 kB●●Boot argument: transparent_hugepages=always (enabled by default)# echo always > /sys/kernel/mm/redhat_transparent_hugepage/enabled# time ./memory 15GBreal0m7.024suser 0m0.073ssys 0m6.847s# cat /proc/meminfoMemTotal: 16331124 kBAnonHugePages: 15590528 kBSPEEDUP 12.4/7.0 = 1.77x, 56%


<strong>Performance</strong> – RHEL 6 / S<strong>and</strong>y BridgeSpecjbb Java w/ 1GB huge pages●S<strong>and</strong>y Bridge has 1GB hugepagesRHEL6.4 SPECjbb w/ 2M/1G hugepages●Support in RHEL5.8 <strong>and</strong> 6.2Intel S<strong>and</strong>y Bridge 16core/32GB●RHEL6. Transparent Huge pagesUse 2M x86_64 page vs 4k page< RHEL6, static use of hugepages● Static pages wired-down●Need application supportDB/Java etcbops9.1%12.6%14.0%12.0%10.0%8.0%6.0%sun_hotspot%gain●Automatically use huge pagesFor all anonymous memory4.0%2.0%Daemon to gather free dynamically0.0%0.0%RHEL6.2 2M HugePageRHEL6.2 (disable THP)RHEL6.2 1GB HugePage


32-bitMemory Zones64-bitUp to 64 GB(PAE)End of RAMHighmem ZoneNormal Zone896 MB or 3968MB4GBNormal Zone16MBDMA Zone0DMA32 Zone16MBDMA Zone0


Split LRU pagelists●●●●●Separate page-lists for anonymous <strong>and</strong> pagecachePrevents mixing of anonymous <strong>and</strong> file-backed pageson active <strong>and</strong> inactive LRU listsEliminates long pauses when all CPUs enter directreclaim during memory exhaustionPrevents swapping when copying very large filesPrevents swapping of database cache during backup.


Per Node/Zone split LRU Paging DynamicsUser AllocationsReactivateanonLRUfileLRUPage agingACTIVEanonLRUfileLRUINACTIVEswapoutReclaimingFREEflushUser deletions


What is NUMA?●●●●Non Uniform Memory AccessA result of making bigger systems more scalable bydistributing system memory near individual CPUs....All multi-socket x86_64 server systems are NUMA●●Most servers have 1 NUMA node / socketRecent AMD systems have 2 NUMA nodes / socketKeep interleave memory in BIOS off (default)●Else OS will see only 1-NUMA node!!!


Typical System Building BlockMemory Controller <strong>and</strong> node RAMCore 0Core 2Shared L3CacheCore 1Core 3QPI links, IO, etc.


Two NUMA node systemNode 0 Node 1Node 0 RAMNode 1 RAMCore 0Core 2L3CacheCore 1Core 3Core 0Core 2L3CacheCore 1Core 3QPI links, IO, etc.QPI links, IO, etc.


Four NUMA node system,fully-connected topologyNode 0Node 1Node 0 RAMNode 1 RAMCore 0Core 2L3CacheCore 1Core 3Core 0Core 2L3CacheCore 1Core 3QPI links, IO, etc.QPI links, IO, etc.Node 2Node 2 RAMNode 3Node 3 RAMCore 0Core 2L3CacheCore 1Core 3Core 0Core 2L3CacheCore 1Core 3QPI links, IO, etc.QPI links, IO, etc.


Four NUMA node system,ring topologyNode 0Node 1Node 0 RAMNode 1 RAMCore 0Core 2L3CacheCore 1Core 3Core 0Core 2L3CacheCore 1Core 3QPI links, IO, etc.QPI links, IO, etc.Node 2 Node 3Node 2 RAMNode 3 RAMCore 0Core 2L3CacheCore 1Core 3Core 0Core 2L3CacheCore 1Core 3QPI links, IO, etc.QPI links, IO, etc.


Per NUMA-Node ResourcesMemory zones(DMA & Normal zones)CPUsIO/DMA capacityInterrupt processingPage reclamation kernel thread (kswapd#)Lots of other kernel threads


NUMA Nodes <strong>and</strong> Zones64-bitEnd of RAMNode 1Normal ZoneNormal ZoneNode 04GBDMA32 Zone16MBDMA Zone0


zone_reclaim_mode●●●●Controls NUMA specific memory allocation policyWhen set <strong>and</strong> node memory is exhausted:●●Reclaim memory from local node rather than allocatingfrom next nodeSlower allocation, higher NUMA hit ratioWhen clear <strong>and</strong> node memory is exhausted:●●Allocate from all nodes before reclaiming memoryFaster allocation, higher NUMA miss ratioDefault is set at boot time based on NUMA factor


Learn about CPUs via lscpu# lscpuArchitecture:x86_64CPU op-mode(s): 32-bit, 64-bitByte Order:Little EndianCPU(s): 40On-line CPU(s) list: 0-39Thread(s) per core: 1Core(s) per socket: 10CPU socket(s): 4NUMA node(s): 4. . . .L1d cache:32KL1i cache:32KL2 cache:256KL3 cache:30720KNUMA node0 CPU(s): 0,4,8,12,16,20,24,28,32,36NUMA node1 CPU(s): 2,6,10,14,18,22,26,30,34,38NUMA node2 CPU(s): 1,5,9,13,17,21,25,29,33,37NUMA node3 CPU(s): 3,7,11,15,19,23,27,31,35,39


Visualize CPUs via lstopo(from hwloc package)# lstopo


Learn NUMA layout via numactl# numactl --hardwareavailable: 4 nodes (0-3)node 0 cpus: 0 4 8 12 16 20 24 28 32 36node 0 size: 65415 MBnode 0 free: 63482 MBnode 1 cpus: 2 6 10 14 18 22 26 30 34 38node 1 size: 65536 MBnode 1 free: 63968 MBnode 2 cpus: 1 5 9 13 17 21 25 29 33 37node 2 size: 65536 MBnode 2 free: 63897 MBnode 3 cpus: 3 7 11 15 19 23 27 31 35 39node 3 size: 65536 MBnode 3 free: 63971 MBnode distances:node 0 1 2 30: 10 21 21 211: 21 10 21 212: 21 21 10 213: 21 21 21 10


Sample remote access latencies4 socket / 4 node: 1.5x4 socket / 8 node: 2.7x8 socket / 8 node: 2.8x32 node system: 5.5x● (30/32 inter-node latencies >= 4x)10 ( 32/1024: 3.1%)13 ( 32/1024: 3.1%)40 ( 64/1024: 6.2%)48 (448/1024: 43.8%)55 (448/1024: 43.8%)


<strong>Red</strong> <strong>Hat</strong> Enterprise Linux 6.4SPECjbb2005 opt w/ numactlRHEL6.4 SPECjbb OJDK numactlIntel Westmere EX, 40core, 4 socket, 256 GB14000001200000bops1000000800000600000inst4inst3inst2inst14000002000000Metal Default Metal NumaCTL KVM Default KVM NumaCTL


So, what's the NUMA problem?●●●The Linux system scheduler is very good atmaintaining responsiveness <strong>and</strong> optimizing for CPUutilizationTries to use idle CPUs, regardless of where processmemory is located.... Using remote memory degradesperformance!●<strong>Red</strong> <strong>Hat</strong> is working with the upstream community to increaseNUMA awareness of the scheduler <strong>and</strong> to implementautomatic NUMA balancing.Remote memory latency matters most for longrunning,significant processes, e.g., HPTC, VMs, etc.


Use numastat to see memory layout●●●●Rewritten for <strong>Red</strong> <strong>Hat</strong> Enterprise Linux 6.4 to showper-node system <strong>and</strong> process memory information100% compatible with prior version by default,displaying /sys...node/numastat memory allocationstatisticsAny comm<strong>and</strong> options invoke new functionality●●-m for per-node system memory info for per-node process memory infoSee numastat(8)


numastat: compatibility mode# numastatnode0 node1 node2 node3numa_hit 1655286 266159 314693 273846numa_miss 0 2790 0 0numa_foreign 2790 0 0 0interleave_hit 14365 14354 14366 14348local_node 1652364 249938 298463 257638other_node 2922 19011 16230 16208node4 node5 node6 node7numa_hit 252059 529980 240696 375607numa_miss 0 0 0 0numa_foreign 0 0 0 0interleave_hit 14367 14336 14333 14388local_node 235903 513789 224511 361928other_node 16156 16191 16185 13679


numastat: compressed display# numastat -cPer-node numastat info (in MBs):Node 0 Node 1 Node 2 Node 3 Node 4 Node 5 Node 6 Node 7 Total------ ------ ------ ------ ------ ------ ------ ------ -----Numa_Hit 6479 1040 1230 1070 985 2070 941 1468 15284Numa_Miss 0 11 0 0 0 0 0 0 11Numa_Foreign 11 0 0 0 0 0 0 0 11Interleave_Hit 56 56 56 56 56 56 56 56 449Local_Node 6468 977 1166 1007 922 2007 877 1415 14839Other_Node 11 74 63 63 63 63 63 53 455


numastat: per-node meminfo# numastat -mczsPer-node system memory usage (in MBs):Node 0 Node 1 Node 2 Node 3 Node 4 Node 5 Node 6 Node 7 Total------ ------ ------ ------ ------ ------ ------ ------ ------MemTotal 32766 32768 32768 32768 32768 32768 32768 32752 262126MemFree 31863 31965 32120 32086 32098 32080 32114 32062 256388MemUsed 903 803 648 682 670 688 654 690 5738FilePages 11 26 8 37 21 18 9 45 176Slab 25 16 7 10 12 36 10 10 126Active 5 13 4 25 10 9 6 41 113Active(file) 4 11 3 23 8 6 3 40 99SUnreclaim 19 10 6 6 9 33 7 7 97Inactive 7 15 4 14 12 12 6 6 76Inactive(file) 7 15 4 14 12 12 6 6 76SReclaimable 7 6 2 4 3 3 3 2 29Active(anon) 2 1 1 2 2 2 3 2 14AnonPages 2 1 1 2 2 2 3 2 14Mapped 0 0 0 1 4 3 1 1 11KernelStack 9 0 0 0 0 0 0 0 10PageTables 0 0 0 0 1 1 0 1 3Shmem 0 0 0 0 0 0 0 0 0Inactive(anon) 0 0 0 0 0 0 0 0 0


numastat shows unaligned guests# numastat -c qemuPer-node process memory usage (in Mbs)PIDNode 0 Node 1 Node 2 Node 3 Total--------------- ------ ------ ------ ------ -----10587 (qemu-kvm) 1216 4022 4028 1456 1072210629 (qemu-kvm) 2108 56 473 8077 1071410671 (qemu-kvm) 4096 3470 3036 110 1071210713 (qemu-kvm) 4043 3498 2135 1055 10730--------------- ------ ------ ------ ------ -----Total 11462 11045 9672 10698 42877


numastat shows aligned guests# numastat -c qemuPer-node process memory usage (in Mbs)PIDNode 0 Node 1 Node 2 Node 3 Total--------------- ------ ------ ------ ------ -----10587 (qemu-kvm) 0 10723 5 0 1072810629 (qemu-kvm) 0 0 5 10717 1072210671 (qemu-kvm) 0 0 10726 0 1072610713 (qemu-kvm) 10733 0 5 0 10738--------------- ------ ------ ------ ------ -----Total 10733 10723 10740 10717 42913


Some KVM NUMA Suggestions●Don't assign extra resources to guests●●Don't assign more memory than can be usedDon't make guest unnecessarily wide●●●Not much point to more VCPUs than application threadsFor best NUMA affinity <strong>and</strong> performance, the numberof guest VCPUs should be


How to manage NUMA manually●●●●●●Research NUMA topology of each systemMake a resource plan for each systemBind both CPUs <strong>and</strong> Memory●Might also consider devices <strong>and</strong> IRQsUse numactl for native jobs:●“numactl -N -m ”Use numatune for libvirt started guests●Edit xml: Use Cgroups w/ apps to bind cpu/mem to numa nodes


Resource Management using cgroupsAbility to manage large system resources effectivelyControl Group (Cgroups) for CPU/Memory/Network/DiskBenefit: guarantee Quality of Service & dynamic resource allocationIdeal for managing any multi-application environmentFrom back-ups to the Cloud


numad can help improve NUMA performance●New <strong>Red</strong> <strong>Hat</strong> Enterprise Linux 6.4 user-level daemonto automatically improve out of the box NUMA systemperformance, <strong>and</strong> to balance NUMA usage in dynamicworkload environments● Was tech-preview in <strong>Red</strong> <strong>Hat</strong> Enterprise Linux 6.3●●●Improves NUMA performance for some workloadsNot enabled by defaultSee numad(8)


numad matches resource consumers withavailable resourcesNode Scanner:Process Scanner:AvailableCPUsAvailableMemoryRequiredCPUsRequiredMemoryAvailable ResourcesPer NodeConsumed ResourcesPer ProcessNumad PickerNode listFor Process


numad aligns process memory <strong>and</strong> CPU threadswithin nodesBefore numadAfter numadNode 0 Node 1 Node 2 Node 3Node 0 Node 1 Node 2 Node 3Process 37Process 29Process 19Process 61Proc29Proc19 Proc61Proc37


Numad - aligning memory <strong>and</strong> threads in nodes:<strong>Red</strong>uces memory latency, improves determinism


numad usage●●●●numad is intended primarily for server consolidation environments●●●Multiple applications running on the same serverMultiple instances of the same applicationMultiple virtual guestsnumad is most likely to have a positive effect when processes can belocalized in a fractional subset of the system’s NUMA nodes.If the entire system is dedicated to a large in-memory databaseapplication, for example -- especially if memory accesses will likelyremain unpredictable -- numad will probably not improve performance.Similarly, very high b<strong>and</strong>width applications -- that really need all thesystem memory controllers -- will likely not benefit from localization


Start, stop numad, <strong>and</strong> set interval● # numad● -i 0 to terminate the numad daemon● -i [:] to specify interval seconds● Default is “-i 5:15”● Increasing the max interval will decreaseoverhead -- but will also decreaseresponsiveness to changing loads.


To change utilization target● -u to specify target utilization percent● Default is “-u 85”● Increase the utilization target to more fully utilizethe entire resources on each node● Decrease the utilization target to maintain moreper-node resource margin for bursty loads● Also could decrease to force processesacross multiple nodes


Bare Metal - Java WorkloadAutomatic Numad ImprovementMulitinstance Java Workload on 4 Socket, 8 Node system180000090160000080140000070BOPs1200000100000080000060504030DefaultNumad95Numactl%Gain600000204000001020000000-10WHs


To get pre-placement advice● -w : for node suggestions● Output is a recommended node list, e.g., “1-2,4”● Can be used regardless of whether numad isrunning as a daemon● Will take a couple seconds if not running● Used by libvirt for optional VM auto placement● Could be used in shell script for automated jobplacement


numad “-w” shell script example#!/bin/bashPROCESSES=$1; shiftTHREADS=$1; shiftGIGABYTES=$1; shiftecho "Trying $PROCESSES fake 'guests' with $THREADS VCPUs <strong>and</strong> $GIGABYTES GB each."echo "Note average work accomplished -- displayed in a few minutes."for (( i=1; i


numad “-w” shell script(the important part)for (( i=1; i


numad “-w” shell script (ignorant)# ./pig_place_test.sh 5 6 7Trying 5 fake 'guests' with 6 VCPUs <strong>and</strong> 7 GB each.Note average work accomplished -- displayed in a few minutes.numad advises to use nodes: 2 -- but ignoring that <strong>and</strong> not binding.../pig_tool/pig -t 6 -gm 7000 -s 60 -l memnumad advises to use nodes: 7 -- but ignoring that <strong>and</strong> not binding.../pig_tool/pig -t 6 -gm 7000 -s 60 -l memnumad advises to use nodes: 3 -- but ignoring that <strong>and</strong> not binding.../pig_tool/pig -t 6 -gm 7000 -s 60 -l memnumad advises to use nodes: 3 -- but ignoring that <strong>and</strong> not binding.../pig_tool/pig -t 6 -gm 7000 -s 60 -l memnumad advises to use nodes: 3 -- but ignoring that <strong>and</strong> not binding.../pig_tool/pig -t 6 -gm 7000 -s 60 -l memSleeping while the fake guests finish up...Threads: 6 Avg: 39.5 Stddev: 7.6 Min: 33 Max: 54Threads: 6 Avg: 35.8 Stddev: 5.5 Min: 31 Max: 44Threads: 6 Avg: 39.7 Stddev: 4.2 Min: 35 Max: 45Threads: 6 Avg: 49.8 Stddev: 12.5 Min: 33 Max: 62Threads: 6 Avg: 62.3 Stddev: 13.3 Min: 49 Max: 80


numad “-w” shell script (advised)OK, now trying same size fake 'guests' using numad placement advice.Average work accomplished should be higher, stddev might be better too.numad advises to use nodes: 2numactl -N 2 -m 2 ../pig_tool/pig -t 6 -gm 7000 -s 60 -l memnumad advises to use nodes: 1numactl -N 1 -m 1 ../pig_tool/pig -t 6 -gm 7000 -s 60 -l memnumad advises to use nodes: 3numactl -N 3 -m 3 ../pig_tool/pig -t 6 -gm 7000 -s 60 -l memnumad advises to use nodes: 7numactl -N 7 -m 7 ../pig_tool/pig -t 6 -gm 7000 -s 60 -l memnumad advises to use nodes: 6numactl -N 6 -m 6 ../pig_tool/pig -t 6 -gm 7000 -s 60 -l memSleeping while the fake guests finish up...Threads: 6 Avg: 105.0 Stddev: 0.0 Min: 105 Max: 105Threads: 6 Avg: 106.0 Stddev: 0.0 Min: 106 Max: 106Threads: 6 Avg: 106.0 Stddev: 0.0 Min: 106 Max: 106Threads: 6 Avg: 105.0 Stddev: 0.0 Min: 105 Max: 105Threads: 6 Avg: 104.0 Stddev: 0.0 Min: 104 Max: 104


Multiguest - KVM Java WorkloadEight KVM Guests with Java Loadnumad with RHEL6.4 host <strong>and</strong> 6.3 guests25000008070200000060BOPS15000001000000504030defaultnumadnumatune%Gain50000020100wh2 wh4 wh6 wh8Warehouses0


Multiguest Oracle OLTP WorkloadOracle OLTP in TPM 4 KVM GuestsIntel 32-cpu, 128 GB, 2 FC1400000120000010000008000006000004000002000003530252015105defaultnumadNUMA pinned% Gain020U 40U 80UUsers0


numad future● Shipping in <strong>Red</strong> <strong>Hat</strong> Enterprise Linux 6.4●●Potential future improvements: Device <strong>and</strong> IRQaffinity, Related process hintsFuture TBD pending upstream kernel efforts●Perhaps complementary NUMA management roles, assystems will continue to grow in size <strong>and</strong> complexity


Summary / Questions●<strong>Red</strong> <strong>Hat</strong> Enterprise Linux 6 <strong>Performance</strong> Features●●●“TUNED” tool – adjusts system parameters to matchenvironments - throughput/latency.Transparent Huge Pages – auto select large pages foranonymous memory, static hugepages for shared memNon-uniform Memory Access (NUMA)●●●numastat enhancementsnumactl for manual controlnumad daemon for auto placement● TUNA – integration w/ <strong>Red</strong> <strong>Hat</strong> Enterprise Linux 6.4● (...Come back for part 2...)


cgroups Architecture


Cgroup default mount points# cat /etc/cgconfig.conf# ls -l /cgroupmount {cpusetcpucpuacctmemorydevicesfreezernet_clsblkio}= /cgroup/cpuset;= /cgroup/cpu;= /cgroup/cpuacct;= /cgroup/memory;= /cgroup/devices;= /cgroup/freezer;= /cgroup/net_cls;= /cgroup/blkio;drwxr-xr-x 2 root root 0 Jun 21 13:33 blkiodrwxr-xr-x 3 root root 0 Jun 21 13:33 cpudrwxr-xr-x 3 root root 0 Jun 21 13:33 cpuacctdrwxr-xr-x 3 root root 0 Jun 21 13:33 cpusetdrwxr-xr-x 3 root root 0 Jun 21 13:33 devicesdrwxr-xr-x 3 root root 0 Jun 21 13:33 freezerdrwxr-xr-x 3 root root 0 Jun 21 13:33 memorydrwxr-xr-x 2 root root 0 Jun 21 13:33 net_cls


Cgroup how-to1GB/2CPU subset of a 16GB/8CPU system#numactl --hardware#mount -t cgroup xxx /cgroups#mkdir -p /cgroups/test#cd /cgroups/test#echo 1 > cpuset.mems#echo 2-3 > cpuset.cpus#echo 1G > memory.limit_in_bytes#echo $$ > tasks


cgroups[root@dhcp­100­19­50 ~]# forkmany 20MB 100procs &[root@dhcp­100­19­50 ~]# top ­d 5top ­ 12:24:13 up 1:36, 4 users, load average: 22.70, 5.32, 1.79Tasks: 315 total, 93 running, 222 sleeping, 0 stopped, 0 zombieCpu0 : 0.0%us, 0.2%sy, 0.0%ni, 99.8%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%stCpu1 : 0.0%us, 0.2%sy, 0.0%ni, 99.8%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%stCpu2 :100.0%us, 0.0%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%stCpu3 : 89.6%us, 10.0%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.2%hi, 0.2%si, 0.0%stCpu4 : 0.4%us, 0.6%sy, 0.0%ni, 98.8%id, 0.0%wa, 0.0%hi, 0.2%si, 0.0%stCpu5 : 0.4%us, 0.0%sy, 0.0%ni, 99.2%id, 0.0%wa, 0.0%hi, 0.4%si, 0.0%stCpu6 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%stCpu7 : 0.0%us, 0.0%sy, 0.0%ni, 99.8%id, 0.0%wa, 0.0%hi, 0.2%si, 0.0%stMem: 16469476k total, 1993064k used, 14476412k free,33740k buffersSwap: 2031608k total, 185404k used, 1846204k free, 459644k cached


Verify correct bindings# echo 0 > cpuset.mems# echo 0-3 > cpuset.cpus# numastatnode0 node1numa_hit 1648772 438778numa_miss 23459 2134520local_node 1648648 423162other_node 23583 2150136# /common/lwoodman/code/memory 4faulting took 1.616062stouching took 0.364937s# numastatnode0 node1numa_hit 2700423 439550numa_miss 23459 2134520local_node 2700299 423934other_node 23583 2150136


incorrect bindings!# echo 1 > cpuset.mems# echo 0-3 > cpuset.cpus# numastatnode0 node1numa_hit 1623318 434106numa_miss 23459 1082458local_node 1623194 418490other_node 23583 1098074# /common/lwoodman/code/memory 4faulting took 1.976627stouching took 0.454322s# numastatnode0 node1numa_hit 1623341 434147numa_miss 23459 2133738local_node 1623217 418531other_node 23583 2149354


JVM comparison on <strong>Red</strong> <strong>Hat</strong> – SPECjbb201369 Principled Technologies, Inc. & <strong>Red</strong> <strong>Hat</strong>, Inc. – Confidential05/28/13

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!