Performance Analysis and Tuning â Part 1 - Red Hat Summit

Performance Analysis andTuning – Part 1Larry WoodmanSenior Consulting Engineering RHEL/VMBill GrayPrincipal Performance Engineer, Red HatJune 13, 2013

Agenda: Performance Analysis Tuning Part II●●●Part I●●●●Red Hat Enterprise Linux “tuned” profiles, topbenchmark resultsPart II●●●Q & AScalabilty – CFS Scheduler tunables / CgroupsHugepages – Transparent Hugepages, 2MB/1GBNon Uniform Memory Access (NUMA) and NUMADNetwork Performance and Latency-performanceDisk and Filesystem IO - Throughput-performanceSystem Performance/Tools – perf, tuna, systemtap

Red Hat Enterprise Linux: Scale Up & OutTraditional scale-out capabilities have been complemented overthe past five years with scale-up capabilitiesBrings open source value and flexibility to x86_64 Server marketUp to4096 CPUsSupport for scalable architecturesMulti-core and hyperthreadingKernel NUMA and SMP enhancementsScale Up1 CPU1 node Scale Out 1000s nodes

Red Hat Enterprise Linux 6Benchmark platform of choiceSurvey of Benchmark Results 6/2013●System Performance Evaluation Committee● www.spec.org● SPECcpu2006● SPECvirt_sc2010, sc2013● SPECjbb2013●Transaction Processing Council (TPC)● www.tpc.org● TPC-H 3 of top 6 categories● TPC-C Top virtualization w/ IBM●STAC FSI workloads - www.stacresearch.com●SAP Sales and Distribution -www.sap.com/campaigns/benchmark

Red Hat Enterprise Linux 6Benchmark platform of choiceSPEC Benchmark Publications 2011 - 2013120%Percentage using RHEL6 (as of June 1, 2013)100%100%80%79%80%86%60%60%50%45%40%38%20%0%6%0% 0% 0% 0% 0% 0%cpu2006 virt_sc2010 jEnterprise2010 virt_sc2013 jbb2013SPEC® is a registered trademark of the Standard Performance EvaluationCorporation. For more information about SPEC and it's benchmarkssee www.spec.org2011 2012 2013

Red Hat Enterprise Linux 6Benchmark platform of choiceTPC Benchmark Publications 2011 - 2013120%Percentage Using RHEL6 (as of June 1, 2013)100%100%80%67%60%50%40%33%20%0%For more information about the TPC and it'sbenchmarks see www.tpc.org.TPC-CTPC-H2011 2012 2013

Red Hat Enterprise Linux 6.4 vs Windows Server 2012 LINPACK8 Principled Technologies, Inc. & Red Hat, Inc. – Confidential05/28/13

tuned Profile Comparison Matrixkernel.sched_min_granularity_ns4ms 10ms 10ms 10ms 10msTunable default enterprisestoragevirtualhostvirtualguestlatencyperformancethroughputperformancekernel.sched_wakeup_granularity_ns4ms 15ms 15ms 15ms 15msvm.dirty_ratio 20% RAM 40% 10% 40% 40%vm.dirty_background_ratio10% RAM 5%vm.swappiness 60 10 30I/O Scheduler(Elevator)CFQ deadline deadline deadline deadline deadlineFilesystem Barriers On Off Off OffCPU Governor ondemand performance performance performanceDisk Read-ahead4xDisable THPDisable C-StatesYesYeshttps://access.redhat.com/site/solutions/369093

Red Hat Enterprise Linux 6Scheduler TunablesImplements multilevel run queuesfor sockets and cores (asopposed to one run queueper processor or per system)RHEL6 tunables●sched_min_granularity_ns●sched_wakeup_granularity_ns●sched_migration_cost●sched_child_runs_first●sched_latency_nsSocket 0Core 0Thread 0 Thread 1Core 1Thread 0 Thread 1ProcessProcessSocket 1Thread 0 Thread 1ProcessProcessSocket 2ProcessProcessProcessProcessProcessProcessProcessProcessScheduler Compute Queues

Finer grained scheduler tuning●●/proc/sys/kernel/sched_*Red Hat Enterprise Linux 6 Tuned-adm will increase quantum onpar with Red Hat Enterprise Linux 5●●echo 10000000 > /proc/sys/kernel/sched_min_granularity_ns●Minimal preemption granularity for CPU bound tasks. Seesched_latency_ns for details. The default value is 4000000(ns).echo 15000000 > /proc/sys/kernel/sched_wakeup_granularity_ns●The wake-up preemption granularity. Increasing this variablereduces wake-up preemption, reducing disturbance ofcompute bound tasks. Lowering it improves wake-up latencyand throughput for latency critical tasks, particularly when ashort duty cycle load component must compete with CPUbound components. The default value is 5000000 (ns).

Load Balancing●●●●●Scheduler tries to keep all CPUs busy by moving tasks formoverloaded CPUs to idle CPUsDetect using “perf stat”, look for excessive “migrations”/proc/sys/kernel/sched_migration_cost●●Amount of time after the last execution that a task is consideredto be “cache hot” in migration decisions. A “hot” task is less likelyto be migrated, so increasing this variable reduces taskmigrations. The default value is 500000 (ns).If the CPU idle time is higher than expected when there arerunnable processes, try reducing this value. If tasks bouncebetween CPUs or nodes too often, try increasing it.Rule of thumb – increase by 2-10x to reduce load balancingIncrease by 10x on large systems when many CGROUPs areactively used (ex: RHEV/ KVM/RHOS)

Sched_Migration CostRHEL6.3 Effect of sched_migration cost on fork/exitIntel Westmere EP 24cpu/12core, 24 GB mem250.00140.00%200.00120.00%100.00%usec/call150.00100.0080.00%60.00%Percentusec/call default 500ususec/call tuned 4mspercent improvement40.00%50.0020.00%0.00exit_10 exit_100 exit_1000 fork_10 fork_100 fork_10000.00%

sched_child_runs_first●●●fork() behaviorControls whether parent or child runs firstDefault is 0: parent continues before children run.Default is different than RHEL5

●Standard HugePages 2MBRed Hat Enterprise Linux 6Hugepages/ VM Tuning●Reserve/free via●/proc/sys/vm/nr_hugepages●/sys/devices/node/*/hugepages/*/nrhugepagesTLB●Used via hugetlbfs●GB Hugepages 1GB●●●Reserved at boot time/no freeingUsed via hugetlbfsTransparent HugePages 2MB128 data128 instructionPhysical Memory●On by default via boot args or /sysVirtual AddressSpace●Used for anonymous memory

2MB standard Hugepages# echo 2000 > /proc/sys/vm/nr_hugepages# cat /proc/meminfoMemTotal: 16331124 kBMemFree: 11788608 kBHugePages_Total: 2000HugePages_Free: 2000HugePages_Rsvd: 0HugePages_Surp: 0Hugepagesize: 2048 kB# ./hugeshm 1000# cat /proc/meminfoMemTotal: 16331124 kBMemFree: 11788608 kBHugePages_Total: 2000HugePages_Free: 1000HugePages_Rsvd: 1000HugePages_Surp: 0Hugepagesize: 2048 kB

# cat /proc/meminfo | moreHugePages_Total: 8HugePages_Free: 8HugePages_Rsvd: 0HugePages_Surp: 01GB HugepagesBoot arguments -● default_hugepagesz=1G, hugepagesz=1G, hugepages=8#mount -t hugetlbfs none /mnt# ./mmapwrite /mnt/junk 33writing 2097152 pages of random junk to file /mnt/junkwrote 8589934592 bytes to file /mnt/junk# cat /proc/meminfo | moreHugePages_Total: 8HugePages_Free: 0HugePages_Rsvd: 0HugePages_Surp: 0

Transparent Hugepagesecho never > /sys/kernel/mm/transparent_hugepages=never[root@dhcp-100-19-50 code]# time ./memory 15 0real 0m12.434suser 0m0.936ssys 0m11.416s# cat /proc/meminfoMemTotal: 16331124 kBAnonHugePages: 0 kB●●Boot argument: transparent_hugepages=always (enabled by default)# echo always > /sys/kernel/mm/redhat_transparent_hugepage/enabled# time ./memory 15GBreal0m7.024suser 0m0.073ssys 0m6.847s# cat /proc/meminfoMemTotal: 16331124 kBAnonHugePages: 15590528 kBSPEEDUP 12.4/7.0 = 1.77x, 56%

Performance – RHEL 6 / Sandy BridgeSpecjbb Java w/ 1GB huge pages●Sandy Bridge has 1GB hugepagesRHEL6.4 SPECjbb w/ 2M/1G hugepages●Support in RHEL5.8 and 6.2Intel Sandy Bridge 16core/32GB●RHEL6. Transparent Huge pagesUse 2M x86_64 page vs 4k page< RHEL6, static use of hugepages● Static pages wired-down●Need application supportDB/Java etcbops9.1%12.6%14.0%12.0%10.0%8.0%6.0%sun_hotspot%gain●Automatically use huge pagesFor all anonymous memory4.0%2.0%Daemon to gather free dynamically0.0%0.0%RHEL6.2 2M HugePageRHEL6.2 (disable THP)RHEL6.2 1GB HugePage

32-bitMemory Zones64-bitUp to 64 GB(PAE)End of RAMHighmem ZoneNormal Zone896 MB or 3968MB4GBNormal Zone16MBDMA Zone0DMA32 Zone16MBDMA Zone0

Split LRU pagelists●●●●●Separate page-lists for anonymous and pagecachePrevents mixing of anonymous and file-backed pageson active and inactive LRU listsEliminates long pauses when all CPUs enter directreclaim during memory exhaustionPrevents swapping when copying very large filesPrevents swapping of database cache during backup.

Per Node/Zone split LRU Paging DynamicsUser AllocationsReactivateanonLRUfileLRUPage agingACTIVEanonLRUfileLRUINACTIVEswapoutReclaimingFREEflushUser deletions

What is NUMA?●●●●Non Uniform Memory AccessA result of making bigger systems more scalable bydistributing system memory near individual CPUs....All multi-socket x86_64 server systems are NUMA●●Most servers have 1 NUMA node / socketRecent AMD systems have 2 NUMA nodes / socketKeep interleave memory in BIOS off (default)●Else OS will see only 1-NUMA node!!!

Typical System Building BlockMemory Controller and node RAMCore 0Core 2Shared L3CacheCore 1Core 3QPI links, IO, etc.

Two NUMA node systemNode 0 Node 1Node 0 RAMNode 1 RAMCore 0Core 2L3CacheCore 1Core 3Core 0Core 2L3CacheCore 1Core 3QPI links, IO, etc.QPI links, IO, etc.

Four NUMA node system,fully-connected topologyNode 0Node 1Node 0 RAMNode 1 RAMCore 0Core 2L3CacheCore 1Core 3Core 0Core 2L3CacheCore 1Core 3QPI links, IO, etc.QPI links, IO, etc.Node 2Node 2 RAMNode 3Node 3 RAMCore 0Core 2L3CacheCore 1Core 3Core 0Core 2L3CacheCore 1Core 3QPI links, IO, etc.QPI links, IO, etc.

Four NUMA node system,ring topologyNode 0Node 1Node 0 RAMNode 1 RAMCore 0Core 2L3CacheCore 1Core 3Core 0Core 2L3CacheCore 1Core 3QPI links, IO, etc.QPI links, IO, etc.Node 2 Node 3Node 2 RAMNode 3 RAMCore 0Core 2L3CacheCore 1Core 3Core 0Core 2L3CacheCore 1Core 3QPI links, IO, etc.QPI links, IO, etc.

Per NUMA-Node ResourcesMemory zones(DMA & Normal zones)CPUsIO/DMA capacityInterrupt processingPage reclamation kernel thread (kswapd#)Lots of other kernel threads

NUMA Nodes and Zones64-bitEnd of RAMNode 1Normal ZoneNormal ZoneNode 04GBDMA32 Zone16MBDMA Zone0

zone_reclaim_mode●●●●Controls NUMA specific memory allocation policyWhen set and node memory is exhausted:●●Reclaim memory from local node rather than allocatingfrom next nodeSlower allocation, higher NUMA hit ratioWhen clear and node memory is exhausted:●●Allocate from all nodes before reclaiming memoryFaster allocation, higher NUMA miss ratioDefault is set at boot time based on NUMA factor

Learn about CPUs via lscpu# lscpuArchitecture:x86_64CPU op-mode(s): 32-bit, 64-bitByte Order:Little EndianCPU(s): 40On-line CPU(s) list: 0-39Thread(s) per core: 1Core(s) per socket: 10CPU socket(s): 4NUMA node(s): 4. . . .L1d cache:32KL1i cache:32KL2 cache:256KL3 cache:30720KNUMA node0 CPU(s): 0,4,8,12,16,20,24,28,32,36NUMA node1 CPU(s): 2,6,10,14,18,22,26,30,34,38NUMA node2 CPU(s): 1,5,9,13,17,21,25,29,33,37NUMA node3 CPU(s): 3,7,11,15,19,23,27,31,35,39

Visualize CPUs via lstopo(from hwloc package)# lstopo

Learn NUMA layout via numactl# numactl --hardwareavailable: 4 nodes (0-3)node 0 cpus: 0 4 8 12 16 20 24 28 32 36node 0 size: 65415 MBnode 0 free: 63482 MBnode 1 cpus: 2 6 10 14 18 22 26 30 34 38node 1 size: 65536 MBnode 1 free: 63968 MBnode 2 cpus: 1 5 9 13 17 21 25 29 33 37node 2 size: 65536 MBnode 2 free: 63897 MBnode 3 cpus: 3 7 11 15 19 23 27 31 35 39node 3 size: 65536 MBnode 3 free: 63971 MBnode distances:node 0 1 2 30: 10 21 21 211: 21 10 21 212: 21 21 10 213: 21 21 21 10

Sample remote access latencies4 socket / 4 node: 1.5x4 socket / 8 node: 2.7x8 socket / 8 node: 2.8x32 node system: 5.5x● (30/32 inter-node latencies >= 4x)10 ( 32/1024: 3.1%)13 ( 32/1024: 3.1%)40 ( 64/1024: 6.2%)48 (448/1024: 43.8%)55 (448/1024: 43.8%)

Red Hat Enterprise Linux 6.4SPECjbb2005 opt w/ numactlRHEL6.4 SPECjbb OJDK numactlIntel Westmere EX, 40core, 4 socket, 256 GB14000001200000bops1000000800000600000inst4inst3inst2inst14000002000000Metal Default Metal NumaCTL KVM Default KVM NumaCTL

So, what's the NUMA problem?●●●The Linux system scheduler is very good atmaintaining responsiveness and optimizing for CPUutilizationTries to use idle CPUs, regardless of where processmemory is located.... Using remote memory degradesperformance!●Red Hat is working with the upstream community to increaseNUMA awareness of the scheduler and to implementautomatic NUMA balancing.Remote memory latency matters most for longrunning,significant processes, e.g., HPTC, VMs, etc.

Use numastat to see memory layout●●●●Rewritten for Red Hat Enterprise Linux 6.4 to showper-node system and process memory information100% compatible with prior version by default,displaying /sys...node/numastat memory allocationstatisticsAny command options invoke new functionality●●-m for per-node system memory info for per-node process memory infoSee numastat(8)

numastat: compatibility mode# numastatnode0 node1 node2 node3numa_hit 1655286 266159 314693 273846numa_miss 0 2790 0 0numa_foreign 2790 0 0 0interleave_hit 14365 14354 14366 14348local_node 1652364 249938 298463 257638other_node 2922 19011 16230 16208node4 node5 node6 node7numa_hit 252059 529980 240696 375607numa_miss 0 0 0 0numa_foreign 0 0 0 0interleave_hit 14367 14336 14333 14388local_node 235903 513789 224511 361928other_node 16156 16191 16185 13679

numastat: compressed display# numastat -cPer-node numastat info (in MBs):Node 0 Node 1 Node 2 Node 3 Node 4 Node 5 Node 6 Node 7 Total------ ------ ------ ------ ------ ------ ------ ------ -----Numa_Hit 6479 1040 1230 1070 985 2070 941 1468 15284Numa_Miss 0 11 0 0 0 0 0 0 11Numa_Foreign 11 0 0 0 0 0 0 0 11Interleave_Hit 56 56 56 56 56 56 56 56 449Local_Node 6468 977 1166 1007 922 2007 877 1415 14839Other_Node 11 74 63 63 63 63 63 53 455

numastat: per-node meminfo# numastat -mczsPer-node system memory usage (in MBs):Node 0 Node 1 Node 2 Node 3 Node 4 Node 5 Node 6 Node 7 Total------ ------ ------ ------ ------ ------ ------ ------ ------MemTotal 32766 32768 32768 32768 32768 32768 32768 32752 262126MemFree 31863 31965 32120 32086 32098 32080 32114 32062 256388MemUsed 903 803 648 682 670 688 654 690 5738FilePages 11 26 8 37 21 18 9 45 176Slab 25 16 7 10 12 36 10 10 126Active 5 13 4 25 10 9 6 41 113Active(file) 4 11 3 23 8 6 3 40 99SUnreclaim 19 10 6 6 9 33 7 7 97Inactive 7 15 4 14 12 12 6 6 76Inactive(file) 7 15 4 14 12 12 6 6 76SReclaimable 7 6 2 4 3 3 3 2 29Active(anon) 2 1 1 2 2 2 3 2 14AnonPages 2 1 1 2 2 2 3 2 14Mapped 0 0 0 1 4 3 1 1 11KernelStack 9 0 0 0 0 0 0 0 10PageTables 0 0 0 0 1 1 0 1 3Shmem 0 0 0 0 0 0 0 0 0Inactive(anon) 0 0 0 0 0 0 0 0 0

numastat shows unaligned guests# numastat -c qemuPer-node process memory usage (in Mbs)PIDNode 0 Node 1 Node 2 Node 3 Total--------------- ------ ------ ------ ------ -----10587 (qemu-kvm) 1216 4022 4028 1456 1072210629 (qemu-kvm) 2108 56 473 8077 1071410671 (qemu-kvm) 4096 3470 3036 110 1071210713 (qemu-kvm) 4043 3498 2135 1055 10730--------------- ------ ------ ------ ------ -----Total 11462 11045 9672 10698 42877

numastat shows aligned guests# numastat -c qemuPer-node process memory usage (in Mbs)PIDNode 0 Node 1 Node 2 Node 3 Total--------------- ------ ------ ------ ------ -----10587 (qemu-kvm) 0 10723 5 0 1072810629 (qemu-kvm) 0 0 5 10717 1072210671 (qemu-kvm) 0 0 10726 0 1072610713 (qemu-kvm) 10733 0 5 0 10738--------------- ------ ------ ------ ------ -----Total 10733 10723 10740 10717 42913

Some KVM NUMA Suggestions●Don't assign extra resources to guests●●Don't assign more memory than can be usedDon't make guest unnecessarily wide●●●Not much point to more VCPUs than application threadsFor best NUMA affinity and performance, the numberof guest VCPUs should be

How to manage NUMA manually●●●●●●Research NUMA topology of each systemMake a resource plan for each systemBind both CPUs and Memory●Might also consider devices and IRQsUse numactl for native jobs:●“numactl -N -m ”Use numatune for libvirt started guests●Edit xml: Use Cgroups w/ apps to bind cpu/mem to numa nodes

Resource Management using cgroupsAbility to manage large system resources effectivelyControl Group (Cgroups) for CPU/Memory/Network/DiskBenefit: guarantee Quality of Service & dynamic resource allocationIdeal for managing any multi-application environmentFrom back-ups to the Cloud

numad can help improve NUMA performance●New Red Hat Enterprise Linux 6.4 user-level daemonto automatically improve out of the box NUMA systemperformance, and to balance NUMA usage in dynamicworkload environments● Was tech-preview in Red Hat Enterprise Linux 6.3●●●Improves NUMA performance for some workloadsNot enabled by defaultSee numad(8)

numad matches resource consumers withavailable resourcesNode Scanner:Process Scanner:AvailableCPUsAvailableMemoryRequiredCPUsRequiredMemoryAvailable ResourcesPer NodeConsumed ResourcesPer ProcessNumad PickerNode listFor Process

numad aligns process memory and CPU threadswithin nodesBefore numadAfter numadNode 0 Node 1 Node 2 Node 3Node 0 Node 1 Node 2 Node 3Process 37Process 29Process 19Process 61Proc29Proc19 Proc61Proc37

Numad - aligning memory and threads in nodes:Reduces memory latency, improves determinism

numad usage●●●●numad is intended primarily for server consolidation environments●●●Multiple applications running on the same serverMultiple instances of the same applicationMultiple virtual guestsnumad is most likely to have a positive effect when processes can belocalized in a fractional subset of the system’s NUMA nodes.If the entire system is dedicated to a large in-memory databaseapplication, for example -- especially if memory accesses will likelyremain unpredictable -- numad will probably not improve performance.Similarly, very high bandwidth applications -- that really need all thesystem memory controllers -- will likely not benefit from localization

Start, stop numad, and set interval● # numad● -i 0 to terminate the numad daemon● -i [:] to specify interval seconds● Default is “-i 5:15”● Increasing the max interval will decreaseoverhead -- but will also decreaseresponsiveness to changing loads.

To change utilization target● -u to specify target utilization percent● Default is “-u 85”● Increase the utilization target to more fully utilizethe entire resources on each node● Decrease the utilization target to maintain moreper-node resource margin for bursty loads● Also could decrease to force processesacross multiple nodes

Bare Metal - Java WorkloadAutomatic Numad ImprovementMulitinstance Java Workload on 4 Socket, 8 Node system180000090160000080140000070BOPs1200000100000080000060504030DefaultNumad95Numactl%Gain600000204000001020000000-10WHs

To get pre-placement advice● -w : for node suggestions● Output is a recommended node list, e.g., “1-2,4”● Can be used regardless of whether numad isrunning as a daemon● Will take a couple seconds if not running● Used by libvirt for optional VM auto placement● Could be used in shell script for automated jobplacement

numad “-w” shell script example#!/bin/bashPROCESSES=$1; shiftTHREADS=$1; shiftGIGABYTES=$1; shiftecho "Trying $PROCESSES fake 'guests' with $THREADS VCPUs and $GIGABYTES GB each."echo "Note average work accomplished -- displayed in a few minutes."for (( i=1; i

numad “-w” shell script(the important part)for (( i=1; i

numad “-w” shell script (ignorant)# ./pig_place_test.sh 5 6 7Trying 5 fake 'guests' with 6 VCPUs and 7 GB each.Note average work accomplished -- displayed in a few minutes.numad advises to use nodes: 2 -- but ignoring that and not binding.../pig_tool/pig -t 6 -gm 7000 -s 60 -l memnumad advises to use nodes: 7 -- but ignoring that and not binding.../pig_tool/pig -t 6 -gm 7000 -s 60 -l memnumad advises to use nodes: 3 -- but ignoring that and not binding.../pig_tool/pig -t 6 -gm 7000 -s 60 -l memnumad advises to use nodes: 3 -- but ignoring that and not binding.../pig_tool/pig -t 6 -gm 7000 -s 60 -l memnumad advises to use nodes: 3 -- but ignoring that and not binding.../pig_tool/pig -t 6 -gm 7000 -s 60 -l memSleeping while the fake guests finish up...Threads: 6 Avg: 39.5 Stddev: 7.6 Min: 33 Max: 54Threads: 6 Avg: 35.8 Stddev: 5.5 Min: 31 Max: 44Threads: 6 Avg: 39.7 Stddev: 4.2 Min: 35 Max: 45Threads: 6 Avg: 49.8 Stddev: 12.5 Min: 33 Max: 62Threads: 6 Avg: 62.3 Stddev: 13.3 Min: 49 Max: 80

numad “-w” shell script (advised)OK, now trying same size fake 'guests' using numad placement advice.Average work accomplished should be higher, stddev might be better too.numad advises to use nodes: 2numactl -N 2 -m 2 ../pig_tool/pig -t 6 -gm 7000 -s 60 -l memnumad advises to use nodes: 1numactl -N 1 -m 1 ../pig_tool/pig -t 6 -gm 7000 -s 60 -l memnumad advises to use nodes: 3numactl -N 3 -m 3 ../pig_tool/pig -t 6 -gm 7000 -s 60 -l memnumad advises to use nodes: 7numactl -N 7 -m 7 ../pig_tool/pig -t 6 -gm 7000 -s 60 -l memnumad advises to use nodes: 6numactl -N 6 -m 6 ../pig_tool/pig -t 6 -gm 7000 -s 60 -l memSleeping while the fake guests finish up...Threads: 6 Avg: 105.0 Stddev: 0.0 Min: 105 Max: 105Threads: 6 Avg: 106.0 Stddev: 0.0 Min: 106 Max: 106Threads: 6 Avg: 106.0 Stddev: 0.0 Min: 106 Max: 106Threads: 6 Avg: 105.0 Stddev: 0.0 Min: 105 Max: 105Threads: 6 Avg: 104.0 Stddev: 0.0 Min: 104 Max: 104

Multiguest - KVM Java WorkloadEight KVM Guests with Java Loadnumad with RHEL6.4 host and 6.3 guests25000008070200000060BOPS15000001000000504030defaultnumadnumatune%Gain50000020100wh2 wh4 wh6 wh8Warehouses0

Multiguest Oracle OLTP WorkloadOracle OLTP in TPM 4 KVM GuestsIntel 32-cpu, 128 GB, 2 FC1400000120000010000008000006000004000002000003530252015105defaultnumadNUMA pinned% Gain020U 40U 80UUsers0

numad future● Shipping in Red Hat Enterprise Linux 6.4●●Potential future improvements: Device and IRQaffinity, Related process hintsFuture TBD pending upstream kernel efforts●Perhaps complementary NUMA management roles, assystems will continue to grow in size and complexity

Summary / Questions●Red Hat Enterprise Linux 6 Performance Features●●●“TUNED” tool – adjusts system parameters to matchenvironments - throughput/latency.Transparent Huge Pages – auto select large pages foranonymous memory, static hugepages for shared memNon-uniform Memory Access (NUMA)●●●numastat enhancementsnumactl for manual controlnumad daemon for auto placement● TUNA – integration w/ Red Hat Enterprise Linux 6.4● (...Come back for part 2...)

cgroups Architecture

Cgroup default mount points# cat /etc/cgconfig.conf# ls -l /cgroupmount {cpusetcpucpuacctmemorydevicesfreezernet_clsblkio}= /cgroup/cpuset;= /cgroup/cpu;= /cgroup/cpuacct;= /cgroup/memory;= /cgroup/devices;= /cgroup/freezer;= /cgroup/net_cls;= /cgroup/blkio;drwxr-xr-x 2 root root 0 Jun 21 13:33 blkiodrwxr-xr-x 3 root root 0 Jun 21 13:33 cpudrwxr-xr-x 3 root root 0 Jun 21 13:33 cpuacctdrwxr-xr-x 3 root root 0 Jun 21 13:33 cpusetdrwxr-xr-x 3 root root 0 Jun 21 13:33 devicesdrwxr-xr-x 3 root root 0 Jun 21 13:33 freezerdrwxr-xr-x 3 root root 0 Jun 21 13:33 memorydrwxr-xr-x 2 root root 0 Jun 21 13:33 net_cls

Cgroup how-to1GB/2CPU subset of a 16GB/8CPU system#numactl --hardware#mount -t cgroup xxx /cgroups#mkdir -p /cgroups/test#cd /cgroups/test#echo 1 > cpuset.mems#echo 2-3 > cpuset.cpus#echo 1G > memory.limit_in_bytes#echo $$ > tasks

cgroups[root@dhcp1001950 ~]# forkmany 20MB 100procs &[root@dhcp1001950 ~]# top d 5top 12:24:13 up 1:36, 4 users, load average: 22.70, 5.32, 1.79Tasks: 315 total, 93 running, 222 sleeping, 0 stopped, 0 zombieCpu0 : 0.0%us, 0.2%sy, 0.0%ni, 99.8%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%stCpu1 : 0.0%us, 0.2%sy, 0.0%ni, 99.8%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%stCpu2 :100.0%us, 0.0%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%stCpu3 : 89.6%us, 10.0%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.2%hi, 0.2%si, 0.0%stCpu4 : 0.4%us, 0.6%sy, 0.0%ni, 98.8%id, 0.0%wa, 0.0%hi, 0.2%si, 0.0%stCpu5 : 0.4%us, 0.0%sy, 0.0%ni, 99.2%id, 0.0%wa, 0.0%hi, 0.4%si, 0.0%stCpu6 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%stCpu7 : 0.0%us, 0.0%sy, 0.0%ni, 99.8%id, 0.0%wa, 0.0%hi, 0.2%si, 0.0%stMem: 16469476k total, 1993064k used, 14476412k free,33740k buffersSwap: 2031608k total, 185404k used, 1846204k free, 459644k cached

Verify correct bindings# echo 0 > cpuset.mems# echo 0-3 > cpuset.cpus# numastatnode0 node1numa_hit 1648772 438778numa_miss 23459 2134520local_node 1648648 423162other_node 23583 2150136# /common/lwoodman/code/memory 4faulting took 1.616062stouching took 0.364937s# numastatnode0 node1numa_hit 2700423 439550numa_miss 23459 2134520local_node 2700299 423934other_node 23583 2150136

incorrect bindings!# echo 1 > cpuset.mems# echo 0-3 > cpuset.cpus# numastatnode0 node1numa_hit 1623318 434106numa_miss 23459 1082458local_node 1623194 418490other_node 23583 1098074# /common/lwoodman/code/memory 4faulting took 1.976627stouching took 0.454322s# numastatnode0 node1numa_hit 1623341 434147numa_miss 23459 2133738local_node 1623217 418531other_node 23583 2149354

JVM comparison on Red Hat – SPECjbb201369 Principled Technologies, Inc. & Red Hat, Inc. – Confidential05/28/13

Performance Analysis and Tuning â Part 1 - Red Hat Summit

Create successful ePaper yourself

Delete template?

Save as template?

Performance Analysis and Tuning â Part 1 - Red Hat Summit