HPCToolkit: Performance Analysis on Heterogeneous ...
HPCToolkit: Performance Analysis on Heterogeneous ...
HPCToolkit: Performance Analysis on Heterogeneous ...
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
<str<strong>on</strong>g>HPCToolkit</str<strong>on</strong>g>:<br />
A Tool for <str<strong>on</strong>g>Performance</str<strong>on</strong>g> <str<strong>on</strong>g>Analysis</str<strong>on</strong>g><br />
<strong>on</strong> <strong>Heterogeneous</strong> Supercomputers<br />
Milind Chabbi, Karthik Murthy,<br />
Mike Fagan, and John Mellor-Crummey<br />
Rice University<br />
GTC 2013<br />
San Jose<br />
March 21, 2013
Architectural Shift in Supercomputing<br />
18K 16-core 18K Tesla K20<br />
Milind Chabbi<br />
<str<strong>on</strong>g>HPCToolkit</str<strong>on</strong>g>: A Tool for <str<strong>on</strong>g>Performance</str<strong>on</strong>g> <str<strong>on</strong>g>Analysis</str<strong>on</strong>g> <strong>on</strong> <strong>Heterogeneous</strong> Supercomputers
Architectural Shift in Supercomputing<br />
Keeneland (KFS ) at University of Tennessee<br />
★<br />
264 nodes of two 8-core Xe<strong>on</strong> E5 CPUs + 3 Nvidia M2090 GPUs<br />
2, 8-core<br />
Xe<strong>on</strong> E5<br />
3 Tesla M2090<br />
Milind Chabbi<br />
<str<strong>on</strong>g>HPCToolkit</str<strong>on</strong>g>: A Tool for <str<strong>on</strong>g>Performance</str<strong>on</strong>g> <str<strong>on</strong>g>Analysis</str<strong>on</strong>g> <strong>on</strong> <strong>Heterogeneous</strong> Supercomputers
Challenge: Hybrid Applicati<strong>on</strong>s<br />
Migrating scientific applicati<strong>on</strong>s to heterogeneous systems<br />
✦<br />
Molecular dynamics<br />
★<br />
★<br />
LAMMPS<br />
NAMD<br />
These codes have a<br />
highly-tuned CPU part<br />
✦<br />
Hydrodynamics<br />
and an emerging GPU part<br />
★<br />
LULESH<br />
<str<strong>on</strong>g>Performance</str<strong>on</strong>g> analysis tools<br />
✦<br />
Lattice field theory<br />
play a vital role in<br />
★<br />
Chroma<br />
tuning these applicati<strong>on</strong>s<br />
Milind Chabbi<br />
<str<strong>on</strong>g>HPCToolkit</str<strong>on</strong>g>: A Tool for <str<strong>on</strong>g>Performance</str<strong>on</strong>g> <str<strong>on</strong>g>Analysis</str<strong>on</strong>g> <strong>on</strong> <strong>Heterogeneous</strong> Supercomputers
Project Goals<br />
• Provide performance analysis tools for emerging<br />
heterogeneous supercomputers<br />
✦<br />
GPU kernels al<strong>on</strong>e → whole applicati<strong>on</strong> performance analysis<br />
• Provide performance improvement “insights”<br />
✦<br />
Opportunities for tuning, rather than resource c<strong>on</strong>sumpti<strong>on</strong><br />
• Desired properties of the tool<br />
✦<br />
✦<br />
✦<br />
Introduce minimal executi<strong>on</strong> perturbati<strong>on</strong><br />
Collect compact measurement data at scale<br />
Provide language / programming model independence<br />
Milind Chabbi<br />
<str<strong>on</strong>g>HPCToolkit</str<strong>on</strong>g>: A Tool for <str<strong>on</strong>g>Performance</str<strong>on</strong>g> <str<strong>on</strong>g>Analysis</str<strong>on</strong>g> <strong>on</strong> <strong>Heterogeneous</strong> Supercomputers
Accelerated LAMMPS<br />
Molecular dynamics code LAMMPS has two accelerated<br />
versi<strong>on</strong>s:<br />
LAMMPS-CUDA<br />
LAMMPS-GPU<br />
Offloads many timesteps of force<br />
calculati<strong>on</strong> GPUs<br />
3 GPUs busy; 3 CPU cores<br />
idle and 13 CPU cores<br />
unused<br />
Atom data moves in and out of<br />
GPU each timestep<br />
CPU and GPU computati<strong>on</strong>s<br />
are not overlapped<br />
Emerging support for CPU-GPU<br />
load balancing<br />
Milind Chabbi<br />
<str<strong>on</strong>g>HPCToolkit</str<strong>on</strong>g>: A Tool for <str<strong>on</strong>g>Performance</str<strong>on</strong>g> <str<strong>on</strong>g>Analysis</str<strong>on</strong>g> <strong>on</strong> <strong>Heterogeneous</strong> Supercomputers
LAMMPS-CUDA <strong>on</strong> Keeneland<br />
CPU<br />
Mostly Idle<br />
GPU STREAM<br />
GPU STREAM<br />
Time<br />
Milind Chabbi<br />
<str<strong>on</strong>g>HPCToolkit</str<strong>on</strong>g>: A Tool for <str<strong>on</strong>g>Performance</str<strong>on</strong>g> <str<strong>on</strong>g>Analysis</str<strong>on</strong>g> <strong>on</strong> <strong>Heterogeneous</strong> Supercomputers
Idleness Wastes Compute Power<br />
• Offloading entire computati<strong>on</strong> to GPUs wastes CPU<br />
•<br />
Performing entire computati<strong>on</strong> <strong>on</strong> CPUs wastes GPU<br />
Better approach:<br />
• Overlap CPU and GPU computati<strong>on</strong><br />
✦<br />
✦<br />
Divide principal computati<strong>on</strong> between CPU and GPU<br />
Use CPU to prepare next work while GPU is busy<br />
Milind Chabbi<br />
<str<strong>on</strong>g>HPCToolkit</str<strong>on</strong>g>: A Tool for <str<strong>on</strong>g>Performance</str<strong>on</strong>g> <str<strong>on</strong>g>Analysis</str<strong>on</strong>g> <strong>on</strong> <strong>Heterogeneous</strong> Supercomputers
Focus <strong>on</strong> Resource (under) Utilizati<strong>on</strong><br />
• To achieve high performance <strong>on</strong> heterogeneous systems,<br />
applicati<strong>on</strong>s should exploit all compute resources<br />
• Problem: “hotspot analysis” of today’s performance<br />
analysis tools is insufficient for tuning<br />
✦<br />
✦<br />
✦<br />
Focuses <strong>on</strong> “most c<strong>on</strong>sumed” resources<br />
Identifies <strong>on</strong>ly symptoms of performance problems<br />
Does not identify causes of problems<br />
Milind Chabbi<br />
<str<strong>on</strong>g>HPCToolkit</str<strong>on</strong>g>: A Tool for <str<strong>on</strong>g>Performance</str<strong>on</strong>g> <str<strong>on</strong>g>Analysis</str<strong>on</strong>g> <strong>on</strong> <strong>Heterogeneous</strong> Supercomputers
Our Methodology: Pinpoint and Quantify Idleness<br />
• Quantify resource idleness<br />
• Attribute idleness to its causes<br />
✦<br />
✦<br />
If GPU is idle → blame CPU code for not offloading (enough) work<br />
If CPU is waiting for results from GPU → blame GPU kernel(s)<br />
involved<br />
Milind Chabbi<br />
<str<strong>on</strong>g>HPCToolkit</str<strong>on</strong>g>: A Tool for <str<strong>on</strong>g>Performance</str<strong>on</strong>g> <str<strong>on</strong>g>Analysis</str<strong>on</strong>g> <strong>on</strong> <strong>Heterogeneous</strong> Supercomputers
Our Methodology: Pinpoint and Quantify Idleness<br />
• Quantify resource idleness<br />
• Attribute idleness to its causes<br />
✦<br />
✦<br />
If GPU is idle → blame CPU code for not offloading (enough) work<br />
If CPU is waiting for results from GPU → blame GPU kernel(s)<br />
involved<br />
a.k.a. “CPU-GPU Blame Shifting”<br />
Milind Chabbi<br />
<str<strong>on</strong>g>HPCToolkit</str<strong>on</strong>g>: A Tool for <str<strong>on</strong>g>Performance</str<strong>on</strong>g> <str<strong>on</strong>g>Analysis</str<strong>on</strong>g> <strong>on</strong> <strong>Heterogeneous</strong> Supercomputers
Quantifying Tuning Opportunities with<br />
CPU-GPU Blame Shifting - 1<br />
5% Idle 40% Idle<br />
CPU<br />
WORK<br />
SYNC<br />
CPU<br />
WORK<br />
SYNC<br />
Kernel A<br />
Kernel B<br />
Time<br />
5% expected gain by<br />
tuning Kernel A<br />
40% expected gain<br />
by tuning Kernel B<br />
Top GPU-kernel may not be the best candidate for tuning<br />
Milind Chabbi<br />
<str<strong>on</strong>g>HPCToolkit</str<strong>on</strong>g>: A Tool for <str<strong>on</strong>g>Performance</str<strong>on</strong>g> <str<strong>on</strong>g>Analysis</str<strong>on</strong>g> <strong>on</strong> <strong>Heterogeneous</strong> Supercomputers
Quantifying Tuning Opportunities with<br />
CPU-GPU Blame Shifting - 1<br />
5% Idle 40% Idle<br />
CPU<br />
WORK<br />
SYNC<br />
CPU<br />
WORK<br />
SYNC<br />
Kernel A<br />
Kernel B<br />
Time<br />
5% expected gain by<br />
tuning Kernel A<br />
40% expected gain<br />
by tuning Kernel B<br />
Hotspot analysis<br />
Blame shifting<br />
Top GPU-kernel may not be the best candidate for tuning<br />
Milind Chabbi<br />
<str<strong>on</strong>g>HPCToolkit</str<strong>on</strong>g>: A Tool for <str<strong>on</strong>g>Performance</str<strong>on</strong>g> <str<strong>on</strong>g>Analysis</str<strong>on</strong>g> <strong>on</strong> <strong>Heterogeneous</strong> Supercomputers
Hotspot vs. Blame Shifting <strong>on</strong> LULESH<br />
CalcFBHourglassForceForElems 1<br />
CalcKinematicsForElems 2<br />
CalcM<strong>on</strong>ot<strong>on</strong>icQGradientsForElems 3<br />
CalcHourglassC<strong>on</strong>trolForElems 4<br />
IntegrateStressForElems 5<br />
AddNodeForcesFromElems2 6<br />
AddNodeForcesFromElems 7<br />
CalcPressureForElems 8<br />
CalcM<strong>on</strong>ot<strong>on</strong>icQRegi<strong>on</strong>ForElems 9<br />
EvalEOSForElemsPart1 10<br />
CalcEnergyForElemsPart2 11<br />
CalcEnergyForElemsPart3 12<br />
CalcPositi<strong>on</strong>ForNodes 13<br />
CalcVelocityForNodes 14<br />
CalcEnergyForElemsPart4 15<br />
CalcAccelerati<strong>on</strong>ForNodes 16<br />
CalcLagrangeElementsPart2 17<br />
EvalEOSForElemsPart2 18<br />
CalcSoundSpeedForElems 19<br />
CalcEnergyForElemsPart1 20<br />
InitStressTermsForElems 21<br />
CalcCourantC<strong>on</strong>straintForElems 22<br />
CalcHydroC<strong>on</strong>straintForElems 23<br />
ApplyMaterialPropertiesForElemsPart1 24<br />
UpdateVolumesForElems 25<br />
ApplyAccelerati<strong>on</strong>BoundaryC<strong>on</strong>diti<strong>on</strong>sForNo 26<br />
des<br />
Milind Chabbi<br />
Hotspot<br />
1 CalcFBHourglassForceForElems<br />
2 CalcHourglassC<strong>on</strong>trolForElems<br />
3 IntegrateStressForElems<br />
4 AddNodeForcesFromElems2<br />
5 AddNodeForcesFromElems<br />
6 CalcKinematicsForElems<br />
7 EvalEOSForElemsPart2<br />
8 CalcSoundSpeedForElems<br />
9 CalcLagrangeElementsPart2<br />
10 InitStressTermsForElems<br />
11 CalcCourantC<strong>on</strong>straintForElems<br />
12 CalcHydroC<strong>on</strong>straintForElems<br />
13 UpdateVolumesForElems<br />
14 CalcM<strong>on</strong>ot<strong>on</strong>icQGradientsForElems<br />
15 EvalEOSForElemsPart1<br />
16 CalcM<strong>on</strong>ot<strong>on</strong>icQRegi<strong>on</strong>ForElems<br />
17 CalcPressureForElems<br />
18 CalcEnergyForElemsPart2<br />
19 CalcEnergyForElemsPart3<br />
20 CalcEnergyForElemsPart4<br />
21 ApplyMaterialPropertiesForElemsPart1<br />
22 CalcEnergyForElemsPart1<br />
23 CalcPositi<strong>on</strong>ForNodes<br />
24 CalcVelocityForNodes<br />
25 CalcAccelerati<strong>on</strong>ForNodes<br />
26 ApplyAccelerati<strong>on</strong>BoundaryC<strong>on</strong>diti<strong>on</strong>sForNo<br />
des<br />
Blame Shifting<br />
<str<strong>on</strong>g>HPCToolkit</str<strong>on</strong>g>: A Tool for <str<strong>on</strong>g>Performance</str<strong>on</strong>g> <str<strong>on</strong>g>Analysis</str<strong>on</strong>g> <strong>on</strong> <strong>Heterogeneous</strong> Supercomputers
Hotspot vs. Blame Shifting <strong>on</strong> LULESH<br />
CalcFBHourglassForceForElems 1<br />
CalcKinematicsForElems 2<br />
CalcM<strong>on</strong>ot<strong>on</strong>icQGradientsForElems 3<br />
CalcHourglassC<strong>on</strong>trolForElems 4<br />
IntegrateStressForElems 5<br />
AddNodeForcesFromElems2 6<br />
AddNodeForcesFromElems 7<br />
CalcPressureForElems 8<br />
CalcM<strong>on</strong>ot<strong>on</strong>icQRegi<strong>on</strong>ForElems 9<br />
EvalEOSForElemsPart1 10<br />
CalcEnergyForElemsPart2 11<br />
CalcEnergyForElemsPart3 12<br />
CalcPositi<strong>on</strong>ForNodes 13<br />
CalcVelocityForNodes 14<br />
CalcEnergyForElemsPart4 15<br />
CalcAccelerati<strong>on</strong>ForNodes 16<br />
CalcLagrangeElementsPart2 17<br />
EvalEOSForElemsPart2 18<br />
CalcSoundSpeedForElems 19<br />
CalcEnergyForElemsPart1 20<br />
InitStressTermsForElems 21<br />
CalcCourantC<strong>on</strong>straintForElems 22<br />
CalcHydroC<strong>on</strong>straintForElems 23<br />
ApplyMaterialPropertiesForElemsPart1 24<br />
UpdateVolumesForElems 25<br />
ApplyAccelerati<strong>on</strong>BoundaryC<strong>on</strong>diti<strong>on</strong>sForNo 26<br />
des<br />
Milind Chabbi<br />
Hotspot<br />
1 CalcFBHourglassForceForElems<br />
2 CalcHourglassC<strong>on</strong>trolForElems<br />
3 IntegrateStressForElems<br />
4 AddNodeForcesFromElems2<br />
5 AddNodeForcesFromElems<br />
6 CalcKinematicsForElems<br />
7 EvalEOSForElemsPart2<br />
8 CalcSoundSpeedForElems<br />
9 CalcLagrangeElementsPart2<br />
10 InitStressTermsForElems<br />
11 CalcCourantC<strong>on</strong>straintForElems<br />
12 CalcHydroC<strong>on</strong>straintForElems<br />
13 UpdateVolumesForElems<br />
14 CalcM<strong>on</strong>ot<strong>on</strong>icQGradientsForElems<br />
15 EvalEOSForElemsPart1<br />
16 CalcM<strong>on</strong>ot<strong>on</strong>icQRegi<strong>on</strong>ForElems<br />
17 CalcPressureForElems<br />
18 CalcEnergyForElemsPart2<br />
19 CalcEnergyForElemsPart3<br />
20 CalcEnergyForElemsPart4<br />
21 ApplyMaterialPropertiesForElemsPart1<br />
22 CalcEnergyForElemsPart1<br />
23 CalcPositi<strong>on</strong>ForNodes<br />
24 CalcVelocityForNodes<br />
25 CalcAccelerati<strong>on</strong>ForNodes<br />
26 ApplyAccelerati<strong>on</strong>BoundaryC<strong>on</strong>diti<strong>on</strong>sForNo<br />
des<br />
Blame Shifting<br />
<str<strong>on</strong>g>HPCToolkit</str<strong>on</strong>g>: A Tool for <str<strong>on</strong>g>Performance</str<strong>on</strong>g> <str<strong>on</strong>g>Analysis</str<strong>on</strong>g> <strong>on</strong> <strong>Heterogeneous</strong> Supercomputers
Hotspot vs. Blame Shifting <strong>on</strong> LULESH<br />
CalcFBHourglassForceForElems 1<br />
CalcKinematicsForElems 2<br />
CalcM<strong>on</strong>ot<strong>on</strong>icQGradientsForElems 3<br />
CalcHourglassC<strong>on</strong>trolForElems 4<br />
IntegrateStressForElems 5<br />
AddNodeForcesFromElems2 6<br />
AddNodeForcesFromElems 7<br />
CalcPressureForElems 8<br />
CalcM<strong>on</strong>ot<strong>on</strong>icQRegi<strong>on</strong>ForElems 9<br />
EvalEOSForElemsPart1 10<br />
CalcEnergyForElemsPart2 11<br />
CalcEnergyForElemsPart3 12<br />
CalcPositi<strong>on</strong>ForNodes 13<br />
CalcVelocityForNodes 14<br />
CalcEnergyForElemsPart4 15<br />
CalcAccelerati<strong>on</strong>ForNodes 16<br />
CalcLagrangeElementsPart2 17<br />
EvalEOSForElemsPart2 18<br />
CalcSoundSpeedForElems 19<br />
CalcEnergyForElemsPart1 20<br />
InitStressTermsForElems 21<br />
CalcCourantC<strong>on</strong>straintForElems 22<br />
CalcHydroC<strong>on</strong>straintForElems 23<br />
ApplyMaterialPropertiesForElemsPart1 24<br />
UpdateVolumesForElems 25<br />
ApplyAccelerati<strong>on</strong>BoundaryC<strong>on</strong>diti<strong>on</strong>sForNo 26<br />
des<br />
Milind Chabbi<br />
Hotspot<br />
1 CalcFBHourglassForceForElems<br />
2 CalcHourglassC<strong>on</strong>trolForElems<br />
3 IntegrateStressForElems<br />
4 AddNodeForcesFromElems2<br />
5 AddNodeForcesFromElems<br />
6 CalcKinematicsForElems<br />
7 EvalEOSForElemsPart2<br />
8 CalcSoundSpeedForElems<br />
9 CalcLagrangeElementsPart2<br />
10 InitStressTermsForElems<br />
11 CalcCourantC<strong>on</strong>straintForElems<br />
12 CalcHydroC<strong>on</strong>straintForElems<br />
13 UpdateVolumesForElems<br />
14 CalcM<strong>on</strong>ot<strong>on</strong>icQGradientsForElems<br />
15 EvalEOSForElemsPart1<br />
16 CalcM<strong>on</strong>ot<strong>on</strong>icQRegi<strong>on</strong>ForElems<br />
17 CalcPressureForElems<br />
18 CalcEnergyForElemsPart2<br />
19 CalcEnergyForElemsPart3<br />
20 CalcEnergyForElemsPart4<br />
21 ApplyMaterialPropertiesForElemsPart1<br />
22 CalcEnergyForElemsPart1<br />
23 CalcPositi<strong>on</strong>ForNodes<br />
24 CalcVelocityForNodes<br />
25 CalcAccelerati<strong>on</strong>ForNodes<br />
26 ApplyAccelerati<strong>on</strong>BoundaryC<strong>on</strong>diti<strong>on</strong>sForNo<br />
des<br />
Blame Shifting<br />
<str<strong>on</strong>g>HPCToolkit</str<strong>on</strong>g>: A Tool for <str<strong>on</strong>g>Performance</str<strong>on</strong>g> <str<strong>on</strong>g>Analysis</str<strong>on</strong>g> <strong>on</strong> <strong>Heterogeneous</strong> Supercomputers
Hotspot vs. Blame Shifting <strong>on</strong> LULESH<br />
CalcFBHourglassForceForElems 1<br />
CalcKinematicsForElems 2<br />
CalcM<strong>on</strong>ot<strong>on</strong>icQGradientsForElems 3<br />
CalcHourglassC<strong>on</strong>trolForElems 4<br />
IntegrateStressForElems 5<br />
AddNodeForcesFromElems2 6<br />
AddNodeForcesFromElems 7<br />
CalcPressureForElems 8<br />
CalcM<strong>on</strong>ot<strong>on</strong>icQRegi<strong>on</strong>ForElems 9<br />
EvalEOSForElemsPart1 10<br />
CalcEnergyForElemsPart2 11<br />
CalcEnergyForElemsPart3 12<br />
CalcPositi<strong>on</strong>ForNodes 13<br />
CalcVelocityForNodes 14<br />
CalcEnergyForElemsPart4 15<br />
CalcAccelerati<strong>on</strong>ForNodes 16<br />
CalcLagrangeElementsPart2 17<br />
EvalEOSForElemsPart2 18<br />
CalcSoundSpeedForElems 19<br />
CalcEnergyForElemsPart1 20<br />
InitStressTermsForElems 21<br />
CalcCourantC<strong>on</strong>straintForElems 22<br />
CalcHydroC<strong>on</strong>straintForElems 23<br />
ApplyMaterialPropertiesForElemsPart1 24<br />
UpdateVolumesForElems 25<br />
ApplyAccelerati<strong>on</strong>BoundaryC<strong>on</strong>diti<strong>on</strong>sForNo 26<br />
des<br />
Milind Chabbi<br />
Hotspot<br />
1 CalcFBHourglassForceForElems<br />
2 CalcHourglassC<strong>on</strong>trolForElems<br />
3 IntegrateStressForElems<br />
4 AddNodeForcesFromElems2<br />
5 AddNodeForcesFromElems<br />
6 CalcKinematicsForElems<br />
7 EvalEOSForElemsPart2<br />
8 CalcSoundSpeedForElems<br />
9 CalcLagrangeElementsPart2<br />
10 InitStressTermsForElems<br />
11 CalcCourantC<strong>on</strong>straintForElems<br />
12 CalcHydroC<strong>on</strong>straintForElems<br />
13 UpdateVolumesForElems<br />
14 CalcM<strong>on</strong>ot<strong>on</strong>icQGradientsForElems<br />
15 EvalEOSForElemsPart1<br />
16 CalcM<strong>on</strong>ot<strong>on</strong>icQRegi<strong>on</strong>ForElems<br />
17 CalcPressureForElems<br />
18 CalcEnergyForElemsPart2<br />
19 CalcEnergyForElemsPart3<br />
20 CalcEnergyForElemsPart4<br />
21 ApplyMaterialPropertiesForElemsPart1<br />
22 CalcEnergyForElemsPart1<br />
23 CalcPositi<strong>on</strong>ForNodes<br />
24 CalcVelocityForNodes<br />
25 CalcAccelerati<strong>on</strong>ForNodes<br />
26 ApplyAccelerati<strong>on</strong>BoundaryC<strong>on</strong>diti<strong>on</strong>sForNo<br />
des<br />
Blame Shifting<br />
<str<strong>on</strong>g>HPCToolkit</str<strong>on</strong>g>: A Tool for <str<strong>on</strong>g>Performance</str<strong>on</strong>g> <str<strong>on</strong>g>Analysis</str<strong>on</strong>g> <strong>on</strong> <strong>Heterogeneous</strong> Supercomputers
Hotspot vs. Blame Shifting <strong>on</strong> LULESH<br />
CalcFBHourglassForceForElems 1<br />
CalcKinematicsForElems 2<br />
CalcM<strong>on</strong>ot<strong>on</strong>icQGradientsForElems 3<br />
CalcHourglassC<strong>on</strong>trolForElems 4<br />
IntegrateStressForElems 5<br />
AddNodeForcesFromElems2 6<br />
AddNodeForcesFromElems 7<br />
CalcPressureForElems 8<br />
CalcM<strong>on</strong>ot<strong>on</strong>icQRegi<strong>on</strong>ForElems 9<br />
EvalEOSForElemsPart1 10<br />
CalcEnergyForElemsPart2 11<br />
CalcEnergyForElemsPart3 12<br />
CalcPositi<strong>on</strong>ForNodes 13<br />
CalcVelocityForNodes 14<br />
CalcEnergyForElemsPart4 15<br />
CalcAccelerati<strong>on</strong>ForNodes 16<br />
CalcLagrangeElementsPart2 17<br />
EvalEOSForElemsPart2 18<br />
CalcSoundSpeedForElems 19<br />
CalcEnergyForElemsPart1 20<br />
InitStressTermsForElems 21<br />
CalcCourantC<strong>on</strong>straintForElems 22<br />
CalcHydroC<strong>on</strong>straintForElems 23<br />
ApplyMaterialPropertiesForElemsPart1 24<br />
UpdateVolumesForElems 25<br />
ApplyAccelerati<strong>on</strong>BoundaryC<strong>on</strong>diti<strong>on</strong>sForNo 26<br />
des<br />
Milind Chabbi<br />
Hotspot<br />
1 CalcFBHourglassForceForElems<br />
2 CalcHourglassC<strong>on</strong>trolForElems<br />
3 IntegrateStressForElems<br />
4 AddNodeForcesFromElems2<br />
5 AddNodeForcesFromElems<br />
6 CalcKinematicsForElems<br />
7 EvalEOSForElemsPart2<br />
8 CalcSoundSpeedForElems<br />
9 CalcLagrangeElementsPart2<br />
10 InitStressTermsForElems<br />
11 CalcCourantC<strong>on</strong>straintForElems<br />
12 CalcHydroC<strong>on</strong>straintForElems<br />
13 UpdateVolumesForElems<br />
14 CalcM<strong>on</strong>ot<strong>on</strong>icQGradientsForElems<br />
15 EvalEOSForElemsPart1<br />
16 CalcM<strong>on</strong>ot<strong>on</strong>icQRegi<strong>on</strong>ForElems<br />
17 CalcPressureForElems<br />
18 CalcEnergyForElemsPart2<br />
19 CalcEnergyForElemsPart3<br />
20 CalcEnergyForElemsPart4<br />
21 ApplyMaterialPropertiesForElemsPart1<br />
22 CalcEnergyForElemsPart1<br />
23 CalcPositi<strong>on</strong>ForNodes<br />
24 CalcVelocityForNodes<br />
25 CalcAccelerati<strong>on</strong>ForNodes<br />
26 ApplyAccelerati<strong>on</strong>BoundaryC<strong>on</strong>diti<strong>on</strong>sForNo<br />
des<br />
Blame Shifting<br />
<str<strong>on</strong>g>HPCToolkit</str<strong>on</strong>g>: A Tool for <str<strong>on</strong>g>Performance</str<strong>on</strong>g> <str<strong>on</strong>g>Analysis</str<strong>on</strong>g> <strong>on</strong> <strong>Heterogeneous</strong> Supercomputers
Hotspot vs. Blame Shifting <strong>on</strong> LULESH<br />
CalcFBHourglassForceForElems 1<br />
CalcKinematicsForElems 2<br />
CalcM<strong>on</strong>ot<strong>on</strong>icQGradientsForElems 3<br />
CalcHourglassC<strong>on</strong>trolForElems 4<br />
IntegrateStressForElems 5<br />
AddNodeForcesFromElems2 6<br />
AddNodeForcesFromElems 7<br />
CalcPressureForElems 8<br />
CalcM<strong>on</strong>ot<strong>on</strong>icQRegi<strong>on</strong>ForElems 9<br />
EvalEOSForElemsPart1 10<br />
CalcEnergyForElemsPart2 11<br />
CalcEnergyForElemsPart3 12<br />
CalcPositi<strong>on</strong>ForNodes 13<br />
CalcVelocityForNodes 14<br />
CalcEnergyForElemsPart4 15<br />
CalcAccelerati<strong>on</strong>ForNodes 16<br />
CalcLagrangeElementsPart2 17<br />
EvalEOSForElemsPart2 18<br />
CalcSoundSpeedForElems 19<br />
CalcEnergyForElemsPart1 20<br />
InitStressTermsForElems 21<br />
CalcCourantC<strong>on</strong>straintForElems 22<br />
CalcHydroC<strong>on</strong>straintForElems 23<br />
ApplyMaterialPropertiesForElemsPart1 24<br />
UpdateVolumesForElems 25<br />
ApplyAccelerati<strong>on</strong>BoundaryC<strong>on</strong>diti<strong>on</strong>sForNo 26<br />
des<br />
Milind Chabbi<br />
Hotspot<br />
1 CalcFBHourglassForceForElems<br />
2 CalcHourglassC<strong>on</strong>trolForElems<br />
3 IntegrateStressForElems<br />
4 AddNodeForcesFromElems2<br />
5 AddNodeForcesFromElems<br />
6 CalcKinematicsForElems<br />
7 EvalEOSForElemsPart2<br />
8 CalcSoundSpeedForElems<br />
9 CalcLagrangeElementsPart2<br />
10 InitStressTermsForElems<br />
11 CalcCourantC<strong>on</strong>straintForElems<br />
12 CalcHydroC<strong>on</strong>straintForElems<br />
13 UpdateVolumesForElems<br />
14 CalcM<strong>on</strong>ot<strong>on</strong>icQGradientsForElems<br />
15 EvalEOSForElemsPart1<br />
16 CalcM<strong>on</strong>ot<strong>on</strong>icQRegi<strong>on</strong>ForElems<br />
17 CalcPressureForElems<br />
18 CalcEnergyForElemsPart2<br />
19 CalcEnergyForElemsPart3<br />
20 CalcEnergyForElemsPart4<br />
21 ApplyMaterialPropertiesForElemsPart1<br />
22 CalcEnergyForElemsPart1<br />
23 CalcPositi<strong>on</strong>ForNodes<br />
24 CalcVelocityForNodes<br />
25 CalcAccelerati<strong>on</strong>ForNodes<br />
26 ApplyAccelerati<strong>on</strong>BoundaryC<strong>on</strong>diti<strong>on</strong>sForNo<br />
des<br />
Blame Shifting<br />
<str<strong>on</strong>g>HPCToolkit</str<strong>on</strong>g>: A Tool for <str<strong>on</strong>g>Performance</str<strong>on</strong>g> <str<strong>on</strong>g>Analysis</str<strong>on</strong>g> <strong>on</strong> <strong>Heterogeneous</strong> Supercomputers
Quantifying Tuning Opportunities with<br />
CPU-GPU Blame Shifting - 2<br />
Vice versa is also true<br />
CPU PART A<br />
S<br />
Y<br />
N<br />
C<br />
CPU PART B<br />
S<br />
Y<br />
N<br />
C<br />
KernelA<br />
GPU IDLE<br />
KernelB<br />
Time<br />
Tuning “CPU PART A” reduces critical path<br />
Milind Chabbi<br />
<str<strong>on</strong>g>HPCToolkit</str<strong>on</strong>g>: A Tool for <str<strong>on</strong>g>Performance</str<strong>on</strong>g> <str<strong>on</strong>g>Analysis</str<strong>on</strong>g> <strong>on</strong> <strong>Heterogeneous</strong> Supercomputers
<str<strong>on</strong>g>HPCToolkit</str<strong>on</strong>g><br />
<str<strong>on</strong>g>Performance</str<strong>on</strong>g> tool for large parallel systems<br />
• Supports multilingual, fully-optimized, statically or<br />
dynamically linked applicati<strong>on</strong>s (no source modificati<strong>on</strong>)<br />
• Measures performance using asynchr<strong>on</strong>ous sampling of<br />
timers and hardware performance counters<br />
✦<br />
Low overhead (under 5%) for both profiling and tracing<br />
• Attributes performance to full call paths<br />
✦<br />
Pthread, OpenMP, MPI, and any combinati<strong>on</strong><br />
• Decentralized → Scales to 1000s of nodes<br />
• Provides GUIs for code- and time- centric analysis<br />
Milind Chabbi<br />
<str<strong>on</strong>g>HPCToolkit</str<strong>on</strong>g>: A Tool for <str<strong>on</strong>g>Performance</str<strong>on</strong>g> <str<strong>on</strong>g>Analysis</str<strong>on</strong>g> <strong>on</strong> <strong>Heterogeneous</strong> Supercomputers
Enhancing <str<strong>on</strong>g>HPCToolkit</str<strong>on</strong>g> for<br />
<strong>Heterogeneous</strong> Supercomputers<br />
• Problem: GPUs (natively) d<strong>on</strong>’t support sampling<br />
• Approach: Instrument GPU runtime APIs to m<strong>on</strong>itor<br />
asynchr<strong>on</strong>ous activities via cudaEvents; use CPU events<br />
as sampling points to interrogate GPU<br />
• CPU-GPU blame shifting<br />
✦<br />
✦<br />
Attributes blame to causes rather than symptoms <strong>on</strong> the fly<br />
Delivers deep insight without recording activity traces<br />
• Support for full calling c<strong>on</strong>text <strong>on</strong> GPUs to distinguish<br />
same kernels launched from different calling c<strong>on</strong>texts<br />
Milind Chabbi<br />
<str<strong>on</strong>g>HPCToolkit</str<strong>on</strong>g>: A Tool for <str<strong>on</strong>g>Performance</str<strong>on</strong>g> <str<strong>on</strong>g>Analysis</str<strong>on</strong>g> <strong>on</strong> <strong>Heterogeneous</strong> Supercomputers
CPU-GPU Blame Shifting in Acti<strong>on</strong> in<br />
<str<strong>on</strong>g>HPCToolkit</str<strong>on</strong>g><br />
CPU<br />
hread<br />
GPU<br />
stream<br />
Time<br />
Milind Chabbi<br />
<str<strong>on</strong>g>HPCToolkit</str<strong>on</strong>g>: A Tool for <str<strong>on</strong>g>Performance</str<strong>on</strong>g> <str<strong>on</strong>g>Analysis</str<strong>on</strong>g> <strong>on</strong> <strong>Heterogeneous</strong> Supercomputers
CPU-GPU Blame Shifting in Acti<strong>on</strong> in<br />
<str<strong>on</strong>g>HPCToolkit</str<strong>on</strong>g><br />
CPU<br />
K<br />
e<br />
r<br />
n<br />
e<br />
l<br />
hread<br />
GPU<br />
stream<br />
Time<br />
Milind Chabbi<br />
<str<strong>on</strong>g>HPCToolkit</str<strong>on</strong>g>: A Tool for <str<strong>on</strong>g>Performance</str<strong>on</strong>g> <str<strong>on</strong>g>Analysis</str<strong>on</strong>g> <strong>on</strong> <strong>Heterogeneous</strong> Supercomputers
CPU-GPU Blame Shifting in Acti<strong>on</strong> in<br />
<str<strong>on</strong>g>HPCToolkit</str<strong>on</strong>g><br />
CPU<br />
e<br />
v<br />
e<br />
n<br />
t<br />
K<br />
e<br />
r<br />
n<br />
e<br />
l<br />
e<br />
v<br />
e<br />
n<br />
t<br />
hread<br />
GPU<br />
stream<br />
Time<br />
Milind Chabbi<br />
<str<strong>on</strong>g>HPCToolkit</str<strong>on</strong>g>: A Tool for <str<strong>on</strong>g>Performance</str<strong>on</strong>g> <str<strong>on</strong>g>Analysis</str<strong>on</strong>g> <strong>on</strong> <strong>Heterogeneous</strong> Supercomputers
CPU-GPU Blame Shifting in Acti<strong>on</strong> in<br />
<str<strong>on</strong>g>HPCToolkit</str<strong>on</strong>g><br />
CPU<br />
e<br />
v<br />
e<br />
n<br />
t<br />
K<br />
e<br />
r<br />
n<br />
e<br />
l<br />
e<br />
v<br />
e<br />
n<br />
t<br />
Overlap<br />
cudaDevice<br />
Synchr<strong>on</strong>ize()<br />
hread<br />
GPU<br />
e<br />
v<br />
e<br />
n<br />
t<br />
Kernel Executi<strong>on</strong><br />
stream<br />
60%<br />
Time<br />
40%<br />
Milind Chabbi<br />
<str<strong>on</strong>g>HPCToolkit</str<strong>on</strong>g>: A Tool for <str<strong>on</strong>g>Performance</str<strong>on</strong>g> <str<strong>on</strong>g>Analysis</str<strong>on</strong>g> <strong>on</strong> <strong>Heterogeneous</strong> Supercomputers
CPU-GPU Blame Shifting in Acti<strong>on</strong> in<br />
<str<strong>on</strong>g>HPCToolkit</str<strong>on</strong>g><br />
CPU<br />
hread<br />
e<br />
v<br />
e<br />
n<br />
t<br />
K<br />
e<br />
r<br />
n<br />
e<br />
l<br />
e<br />
v<br />
e<br />
n<br />
t<br />
EventQuery<br />
Overlap<br />
cudaDevice<br />
Synchr<strong>on</strong>ize()<br />
GPU<br />
e<br />
v<br />
e<br />
n<br />
t<br />
Kernel Executi<strong>on</strong><br />
stream<br />
60%<br />
Time<br />
40%<br />
Milind Chabbi<br />
<str<strong>on</strong>g>HPCToolkit</str<strong>on</strong>g>: A Tool for <str<strong>on</strong>g>Performance</str<strong>on</strong>g> <str<strong>on</strong>g>Analysis</str<strong>on</strong>g> <strong>on</strong> <strong>Heterogeneous</strong> Supercomputers
CPU-GPU Blame Shifting in Acti<strong>on</strong> in<br />
<str<strong>on</strong>g>HPCToolkit</str<strong>on</strong>g><br />
CPU<br />
hread<br />
e<br />
v<br />
e<br />
n<br />
t<br />
K<br />
e<br />
r<br />
n<br />
e<br />
l<br />
e<br />
v<br />
e<br />
n<br />
t<br />
EventQuery<br />
Overlap<br />
EventQuery<br />
cudaDevice<br />
Synchr<strong>on</strong>ize()<br />
GPU<br />
e<br />
v<br />
e<br />
n<br />
t<br />
Kernel Executi<strong>on</strong><br />
stream<br />
60%<br />
Time<br />
40%<br />
Milind Chabbi<br />
<str<strong>on</strong>g>HPCToolkit</str<strong>on</strong>g>: A Tool for <str<strong>on</strong>g>Performance</str<strong>on</strong>g> <str<strong>on</strong>g>Analysis</str<strong>on</strong>g> <strong>on</strong> <strong>Heterogeneous</strong> Supercomputers
CPU-GPU Blame Shifting in Acti<strong>on</strong> in<br />
<str<strong>on</strong>g>HPCToolkit</str<strong>on</strong>g><br />
CPU<br />
hread<br />
e<br />
v<br />
e<br />
n<br />
t<br />
K<br />
e<br />
r<br />
n<br />
e<br />
l<br />
e<br />
v<br />
e<br />
n<br />
t<br />
EventQuery<br />
Overlap<br />
EventQuery<br />
cudaDevice<br />
Synchr<strong>on</strong>ize()<br />
EventQuery<br />
EventQuery<br />
GPU<br />
e<br />
v<br />
e<br />
n<br />
t<br />
Kernel Executi<strong>on</strong><br />
GPU ONLY<br />
stream<br />
60%<br />
Time<br />
40%<br />
Milind Chabbi<br />
<str<strong>on</strong>g>HPCToolkit</str<strong>on</strong>g>: A Tool for <str<strong>on</strong>g>Performance</str<strong>on</strong>g> <str<strong>on</strong>g>Analysis</str<strong>on</strong>g> <strong>on</strong> <strong>Heterogeneous</strong> Supercomputers
CPU-GPU Blame Shifting in Acti<strong>on</strong> in<br />
<str<strong>on</strong>g>HPCToolkit</str<strong>on</strong>g><br />
CPU<br />
hread<br />
e<br />
v<br />
e<br />
n<br />
t<br />
K<br />
e<br />
r<br />
n<br />
e<br />
l<br />
e<br />
v<br />
e<br />
n<br />
t<br />
EventQuery<br />
Overlap<br />
EventQuery<br />
cudaDevice<br />
Synchr<strong>on</strong>ize()<br />
CPU IDLE<br />
EventQuery<br />
EventQuery<br />
GPU<br />
e<br />
v<br />
e<br />
n<br />
t<br />
Kernel Executi<strong>on</strong><br />
GPU ONLY<br />
stream<br />
60%<br />
Time<br />
40%<br />
Milind Chabbi<br />
<str<strong>on</strong>g>HPCToolkit</str<strong>on</strong>g>: A Tool for <str<strong>on</strong>g>Performance</str<strong>on</strong>g> <str<strong>on</strong>g>Analysis</str<strong>on</strong>g> <strong>on</strong> <strong>Heterogeneous</strong> Supercomputers
CPU-GPU Blame Shifting in Acti<strong>on</strong> in<br />
<str<strong>on</strong>g>HPCToolkit</str<strong>on</strong>g><br />
CPU<br />
hread<br />
e<br />
v<br />
e<br />
n<br />
t<br />
K<br />
e<br />
r<br />
n<br />
e<br />
l<br />
e<br />
v<br />
e<br />
n<br />
t<br />
EventQuery<br />
Overlap<br />
EventQuery<br />
cudaDevice<br />
Synchr<strong>on</strong>ize()<br />
CPU IDLE<br />
EventQuery<br />
EventQuery<br />
GPU<br />
e<br />
v<br />
e<br />
n<br />
t<br />
Kernel Executi<strong>on</strong><br />
GPU ONLY<br />
Blame<br />
stream<br />
60%<br />
Time<br />
40%<br />
Milind Chabbi<br />
<str<strong>on</strong>g>HPCToolkit</str<strong>on</strong>g>: A Tool for <str<strong>on</strong>g>Performance</str<strong>on</strong>g> <str<strong>on</strong>g>Analysis</str<strong>on</strong>g> <strong>on</strong> <strong>Heterogeneous</strong> Supercomputers
CPU-GPU Blame Shifting in Acti<strong>on</strong> in<br />
<str<strong>on</strong>g>HPCToolkit</str<strong>on</strong>g><br />
CPU<br />
hread<br />
e<br />
v<br />
e<br />
n<br />
t<br />
K<br />
e<br />
r<br />
n<br />
e<br />
l<br />
e<br />
v<br />
e<br />
n<br />
t<br />
EventQuery<br />
Overlap<br />
EventQuery<br />
cudaDevice<br />
Synchr<strong>on</strong>ize()<br />
CPU IDLE<br />
EventQuery<br />
EventQuery<br />
GPU<br />
e<br />
v<br />
e<br />
n<br />
t<br />
Kernel Executi<strong>on</strong><br />
GPU ONLY<br />
Blame<br />
e<br />
v<br />
e<br />
n<br />
t<br />
stream<br />
60%<br />
Time<br />
40%<br />
Milind Chabbi<br />
<str<strong>on</strong>g>HPCToolkit</str<strong>on</strong>g>: A Tool for <str<strong>on</strong>g>Performance</str<strong>on</strong>g> <str<strong>on</strong>g>Analysis</str<strong>on</strong>g> <strong>on</strong> <strong>Heterogeneous</strong> Supercomputers
CPU-GPU Blame Shifting in Acti<strong>on</strong> in<br />
<str<strong>on</strong>g>HPCToolkit</str<strong>on</strong>g><br />
CPU<br />
hread<br />
e<br />
v<br />
e<br />
n<br />
t<br />
K<br />
e<br />
r<br />
n<br />
e<br />
l<br />
e<br />
v<br />
e<br />
n<br />
t<br />
EventQuery<br />
Overlap<br />
EventQuery<br />
cudaDevice<br />
Synchr<strong>on</strong>ize()<br />
CPU IDLE<br />
EventQuery<br />
EventQuery<br />
CPU ONLY<br />
GPU idle<br />
GPU idle<br />
GPU<br />
e<br />
v<br />
e<br />
n<br />
t<br />
Kernel Executi<strong>on</strong><br />
GPU ONLY<br />
Blame<br />
e<br />
v<br />
e<br />
n<br />
t<br />
stream<br />
60%<br />
Time<br />
40%<br />
Milind Chabbi<br />
<str<strong>on</strong>g>HPCToolkit</str<strong>on</strong>g>: A Tool for <str<strong>on</strong>g>Performance</str<strong>on</strong>g> <str<strong>on</strong>g>Analysis</str<strong>on</strong>g> <strong>on</strong> <strong>Heterogeneous</strong> Supercomputers
CPU-GPU Blame Shifting in Acti<strong>on</strong> in<br />
<str<strong>on</strong>g>HPCToolkit</str<strong>on</strong>g><br />
CPU<br />
hread<br />
e<br />
v<br />
e<br />
n<br />
t<br />
K<br />
e<br />
r<br />
n<br />
e<br />
l<br />
e<br />
v<br />
e<br />
n<br />
t<br />
EventQuery<br />
Overlap<br />
EventQuery<br />
cudaDevice<br />
Synchr<strong>on</strong>ize()<br />
CPU IDLE<br />
EventQuery<br />
EventQuery<br />
CPU ONLY<br />
GPU idle<br />
GPU idle<br />
GPU<br />
e<br />
v<br />
e<br />
n<br />
t<br />
Kernel Executi<strong>on</strong><br />
GPU ONLY<br />
Blame<br />
e<br />
v<br />
e<br />
n<br />
t<br />
GPU IDLE<br />
stream<br />
60%<br />
Time<br />
40%<br />
Milind Chabbi<br />
<str<strong>on</strong>g>HPCToolkit</str<strong>on</strong>g>: A Tool for <str<strong>on</strong>g>Performance</str<strong>on</strong>g> <str<strong>on</strong>g>Analysis</str<strong>on</strong>g> <strong>on</strong> <strong>Heterogeneous</strong> Supercomputers
CPU-GPU Blame Shifting in Acti<strong>on</strong> in<br />
<str<strong>on</strong>g>HPCToolkit</str<strong>on</strong>g><br />
CPU<br />
hread<br />
e<br />
v<br />
e<br />
n<br />
t<br />
K<br />
e<br />
r<br />
n<br />
e<br />
l<br />
e<br />
v<br />
e<br />
n<br />
t<br />
EventQuery<br />
Overlap<br />
EventQuery<br />
cudaDevice<br />
Synchr<strong>on</strong>ize()<br />
CPU IDLE<br />
EventQuery<br />
EventQuery<br />
CPU ONLY<br />
GPU idle<br />
Blame<br />
GPU idle<br />
GPU<br />
e<br />
v<br />
e<br />
n<br />
t<br />
Kernel Executi<strong>on</strong><br />
GPU ONLY<br />
Blame<br />
e<br />
v<br />
e<br />
n<br />
t<br />
GPU IDLE<br />
stream<br />
60%<br />
Time<br />
40%<br />
Milind Chabbi<br />
<str<strong>on</strong>g>HPCToolkit</str<strong>on</strong>g>: A Tool for <str<strong>on</strong>g>Performance</str<strong>on</strong>g> <str<strong>on</strong>g>Analysis</str<strong>on</strong>g> <strong>on</strong> <strong>Heterogeneous</strong> Supercomputers
Advanced Features<br />
• <str<strong>on</strong>g>HPCToolkit</str<strong>on</strong>g>’s CPU-GPU blame shifting supports<br />
✦<br />
✦<br />
✦<br />
✦<br />
Multiple GPU streams<br />
Multiple threads in a process<br />
Multiple MPI ranks sharing same GPU<br />
Any combinati<strong>on</strong> of above<br />
• GPU hardware counters for kernel-level hotspot analysis<br />
• <str<strong>on</strong>g>Performance</str<strong>on</strong>g> modeling when multiplexing GPUs<br />
Milind Chabbi<br />
<str<strong>on</strong>g>HPCToolkit</str<strong>on</strong>g>: A Tool for <str<strong>on</strong>g>Performance</str<strong>on</strong>g> <str<strong>on</strong>g>Analysis</str<strong>on</strong>g> <strong>on</strong> <strong>Heterogeneous</strong> Supercomputers
Insights via Blame Shifting in<br />
LULESH CUDA<br />
• LULESH: Shock Hydrodynamics code<br />
✦<br />
✦<br />
One of five challenge problems in the DARPA UHPC program<br />
Compute intensive simulati<strong>on</strong> of complex multi-material moti<strong>on</strong><br />
• CUDA versi<strong>on</strong> available from LLNL<br />
Milind Chabbi<br />
<str<strong>on</strong>g>HPCToolkit</str<strong>on</strong>g>: A Tool for <str<strong>on</strong>g>Performance</str<strong>on</strong>g> <str<strong>on</strong>g>Analysis</str<strong>on</strong>g> <strong>on</strong> <strong>Heterogeneous</strong> Supercomputers
Milind Chabbi<br />
<str<strong>on</strong>g>HPCToolkit</str<strong>on</strong>g>: A Tool for <str<strong>on</strong>g>Performance</str<strong>on</strong>g> <str<strong>on</strong>g>Analysis</str<strong>on</strong>g> <strong>on</strong> <strong>Heterogeneous</strong> Supercomputers
Insights via Blame Shifting in<br />
LULESH CUDA<br />
CPU<br />
CPU IDLE<br />
CudaMalloc()<br />
CudaFree()<br />
GPU IDLE<br />
GPU IDLE<br />
GPU<br />
Time<br />
Milind Chabbi<br />
<str<strong>on</strong>g>HPCToolkit</str<strong>on</strong>g>: A Tool for <str<strong>on</strong>g>Performance</str<strong>on</strong>g> <str<strong>on</strong>g>Analysis</str<strong>on</strong>g> <strong>on</strong> <strong>Heterogeneous</strong> Supercomputers
LULESH CUDA Memory Allocati<strong>on</strong><br />
9.E+07<br />
8.E+07<br />
Total bytes allocated<br />
7.E+07<br />
6.E+07<br />
5.E+07<br />
4.E+07<br />
3.E+07<br />
2.E+07<br />
1.E+07<br />
0.E+00<br />
Time step 1 Time step 2 Time step 3 Time step 4<br />
1<br />
11<br />
21<br />
31<br />
41<br />
51<br />
61<br />
71<br />
81<br />
91<br />
101<br />
111<br />
121<br />
131<br />
141<br />
151<br />
161<br />
171<br />
181<br />
191<br />
201<br />
211<br />
221<br />
231<br />
241<br />
251<br />
261<br />
271<br />
281<br />
291<br />
301<br />
311<br />
321<br />
Memory request with time<br />
Milind Chabbi<br />
<str<strong>on</strong>g>HPCToolkit</str<strong>on</strong>g>: A Tool for <str<strong>on</strong>g>Performance</str<strong>on</strong>g> <str<strong>on</strong>g>Analysis</str<strong>on</strong>g> <strong>on</strong> <strong>Heterogeneous</strong> Supercomputers
LULESH CUDA Memory Allocati<strong>on</strong><br />
9.E+07<br />
8.E+07<br />
Total bytes allocated<br />
7.E+07<br />
6.E+07<br />
5.E+07<br />
4.E+07<br />
3.E+07<br />
2.E+07<br />
1.E+07<br />
0.E+00<br />
Time step 1 Time step 2 Time step 3 Time step 4<br />
1<br />
11<br />
21<br />
31<br />
41<br />
51<br />
61<br />
71<br />
81<br />
91<br />
101<br />
111<br />
121<br />
131<br />
141<br />
151<br />
161<br />
171<br />
181<br />
191<br />
201<br />
211<br />
221<br />
231<br />
241<br />
251<br />
261<br />
271<br />
281<br />
291<br />
301<br />
311<br />
321<br />
Memory request with time<br />
Hoisting repeated allocati<strong>on</strong>/free improved GPU utilizati<strong>on</strong> from<br />
65% to 95% and reduced executi<strong>on</strong> time by 30%<br />
Milind Chabbi<br />
<str<strong>on</strong>g>HPCToolkit</str<strong>on</strong>g>: A Tool for <str<strong>on</strong>g>Performance</str<strong>on</strong>g> <str<strong>on</strong>g>Analysis</str<strong>on</strong>g> <strong>on</strong> <strong>Heterogeneous</strong> Supercomputers
Understanding Temporal Behavior<br />
• Profiling compresses out the temporal dimensi<strong>on</strong><br />
✦<br />
Temporal behavior e.g., serializati<strong>on</strong> are invisible in profiles<br />
• What can we do Trace call path samples<br />
✦<br />
✦<br />
✦<br />
On each sample trace call path sample for each CPU and GPU<br />
Organize samples al<strong>on</strong>g a timeline<br />
Assign each procedure a color; view depth slice of an executi<strong>on</strong><br />
Compute<br />
resources<br />
Orders of<br />
magnitude smaller<br />
trace sizes<br />
Milind Chabbi<br />
<str<strong>on</strong>g>HPCToolkit</str<strong>on</strong>g>: A Tool for <str<strong>on</strong>g>Performance</str<strong>on</strong>g> <str<strong>on</strong>g>Analysis</str<strong>on</strong>g> <strong>on</strong> <strong>Heterogeneous</strong> Supercomputers
HPCTraceviewer to Identify<br />
<str<strong>on</strong>g>Performance</str<strong>on</strong>g> Issue <strong>on</strong><br />
Keeneland Cluster<br />
DEMO<br />
Milind Chabbi<br />
<str<strong>on</strong>g>HPCToolkit</str<strong>on</strong>g>: A Tool for <str<strong>on</strong>g>Performance</str<strong>on</strong>g> <str<strong>on</strong>g>Analysis</str<strong>on</strong>g> <strong>on</strong> <strong>Heterogeneous</strong> Supercomputers
C<strong>on</strong>clusi<strong>on</strong>s<br />
<str<strong>on</strong>g>HPCToolkit</str<strong>on</strong>g> is a performance analysis tool for<br />
heterogeneous systems<br />
✦<br />
Sampling-centric approach<br />
★<br />
Low overhead and compact measurement data<br />
✦<br />
CPU-GPU blame shifting<br />
★<br />
Pinpoints and quantifies code fragments (CPU and GPU)<br />
worth tuning<br />
✦<br />
Open source, BSD license<br />
hpctoolkit.org<br />
Milind Chabbi<br />
<str<strong>on</strong>g>HPCToolkit</str<strong>on</strong>g>: A Tool for <str<strong>on</strong>g>Performance</str<strong>on</strong>g> <str<strong>on</strong>g>Analysis</str<strong>on</strong>g> <strong>on</strong> <strong>Heterogeneous</strong> Supercomputers
C<strong>on</strong>clusi<strong>on</strong>s<br />
<str<strong>on</strong>g>HPCToolkit</str<strong>on</strong>g> is a performance analysis tool for<br />
heterogeneous systems<br />
✦<br />
Sampling-centric approach<br />
★<br />
Low overhead and compact measurement data<br />
✦<br />
CPU-GPU blame shifting<br />
★<br />
Pinpoints and quantifies code fragments (CPU and GPU)<br />
worth tuning<br />
✦<br />
Open source, BSD license<br />
hpctoolkit.org<br />
Milind Chabbi<br />
<str<strong>on</strong>g>HPCToolkit</str<strong>on</strong>g>: A Tool for <str<strong>on</strong>g>Performance</str<strong>on</strong>g> <str<strong>on</strong>g>Analysis</str<strong>on</strong>g> <strong>on</strong> <strong>Heterogeneous</strong> Supercomputers