27.12.2014 Views

HPCToolkit: Performance Analysis on Heterogeneous ...

HPCToolkit: Performance Analysis on Heterogeneous ...

HPCToolkit: Performance Analysis on Heterogeneous ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<str<strong>on</strong>g>HPCToolkit</str<strong>on</strong>g>:<br />

A Tool for <str<strong>on</strong>g>Performance</str<strong>on</strong>g> <str<strong>on</strong>g>Analysis</str<strong>on</strong>g><br />

<strong>on</strong> <strong>Heterogeneous</strong> Supercomputers<br />

Milind Chabbi, Karthik Murthy,<br />

Mike Fagan, and John Mellor-Crummey<br />

Rice University<br />

GTC 2013<br />

San Jose<br />

March 21, 2013


Architectural Shift in Supercomputing<br />

18K 16-core 18K Tesla K20<br />

Milind Chabbi<br />

<str<strong>on</strong>g>HPCToolkit</str<strong>on</strong>g>: A Tool for <str<strong>on</strong>g>Performance</str<strong>on</strong>g> <str<strong>on</strong>g>Analysis</str<strong>on</strong>g> <strong>on</strong> <strong>Heterogeneous</strong> Supercomputers


Architectural Shift in Supercomputing<br />

Keeneland (KFS ) at University of Tennessee<br />

★<br />

264 nodes of two 8-core Xe<strong>on</strong> E5 CPUs + 3 Nvidia M2090 GPUs<br />

2, 8-core<br />

Xe<strong>on</strong> E5<br />

3 Tesla M2090<br />

Milind Chabbi<br />

<str<strong>on</strong>g>HPCToolkit</str<strong>on</strong>g>: A Tool for <str<strong>on</strong>g>Performance</str<strong>on</strong>g> <str<strong>on</strong>g>Analysis</str<strong>on</strong>g> <strong>on</strong> <strong>Heterogeneous</strong> Supercomputers


Challenge: Hybrid Applicati<strong>on</strong>s<br />

Migrating scientific applicati<strong>on</strong>s to heterogeneous systems<br />

✦<br />

Molecular dynamics<br />

★<br />

★<br />

LAMMPS<br />

NAMD<br />

These codes have a<br />

highly-tuned CPU part<br />

✦<br />

Hydrodynamics<br />

and an emerging GPU part<br />

★<br />

LULESH<br />

<str<strong>on</strong>g>Performance</str<strong>on</strong>g> analysis tools<br />

✦<br />

Lattice field theory<br />

play a vital role in<br />

★<br />

Chroma<br />

tuning these applicati<strong>on</strong>s<br />

Milind Chabbi<br />

<str<strong>on</strong>g>HPCToolkit</str<strong>on</strong>g>: A Tool for <str<strong>on</strong>g>Performance</str<strong>on</strong>g> <str<strong>on</strong>g>Analysis</str<strong>on</strong>g> <strong>on</strong> <strong>Heterogeneous</strong> Supercomputers


Project Goals<br />

• Provide performance analysis tools for emerging<br />

heterogeneous supercomputers<br />

✦<br />

GPU kernels al<strong>on</strong>e → whole applicati<strong>on</strong> performance analysis<br />

• Provide performance improvement “insights”<br />

✦<br />

Opportunities for tuning, rather than resource c<strong>on</strong>sumpti<strong>on</strong><br />

• Desired properties of the tool<br />

✦<br />

✦<br />

✦<br />

Introduce minimal executi<strong>on</strong> perturbati<strong>on</strong><br />

Collect compact measurement data at scale<br />

Provide language / programming model independence<br />

Milind Chabbi<br />

<str<strong>on</strong>g>HPCToolkit</str<strong>on</strong>g>: A Tool for <str<strong>on</strong>g>Performance</str<strong>on</strong>g> <str<strong>on</strong>g>Analysis</str<strong>on</strong>g> <strong>on</strong> <strong>Heterogeneous</strong> Supercomputers


Accelerated LAMMPS<br />

Molecular dynamics code LAMMPS has two accelerated<br />

versi<strong>on</strong>s:<br />

LAMMPS-CUDA<br />

LAMMPS-GPU<br />

Offloads many timesteps of force<br />

calculati<strong>on</strong> GPUs<br />

3 GPUs busy; 3 CPU cores<br />

idle and 13 CPU cores<br />

unused<br />

Atom data moves in and out of<br />

GPU each timestep<br />

CPU and GPU computati<strong>on</strong>s<br />

are not overlapped<br />

Emerging support for CPU-GPU<br />

load balancing<br />

Milind Chabbi<br />

<str<strong>on</strong>g>HPCToolkit</str<strong>on</strong>g>: A Tool for <str<strong>on</strong>g>Performance</str<strong>on</strong>g> <str<strong>on</strong>g>Analysis</str<strong>on</strong>g> <strong>on</strong> <strong>Heterogeneous</strong> Supercomputers


LAMMPS-CUDA <strong>on</strong> Keeneland<br />

CPU<br />

Mostly Idle<br />

GPU STREAM<br />

GPU STREAM<br />

Time<br />

Milind Chabbi<br />

<str<strong>on</strong>g>HPCToolkit</str<strong>on</strong>g>: A Tool for <str<strong>on</strong>g>Performance</str<strong>on</strong>g> <str<strong>on</strong>g>Analysis</str<strong>on</strong>g> <strong>on</strong> <strong>Heterogeneous</strong> Supercomputers


Idleness Wastes Compute Power<br />

• Offloading entire computati<strong>on</strong> to GPUs wastes CPU<br />

•<br />

Performing entire computati<strong>on</strong> <strong>on</strong> CPUs wastes GPU<br />

Better approach:<br />

• Overlap CPU and GPU computati<strong>on</strong><br />

✦<br />

✦<br />

Divide principal computati<strong>on</strong> between CPU and GPU<br />

Use CPU to prepare next work while GPU is busy<br />

Milind Chabbi<br />

<str<strong>on</strong>g>HPCToolkit</str<strong>on</strong>g>: A Tool for <str<strong>on</strong>g>Performance</str<strong>on</strong>g> <str<strong>on</strong>g>Analysis</str<strong>on</strong>g> <strong>on</strong> <strong>Heterogeneous</strong> Supercomputers


Focus <strong>on</strong> Resource (under) Utilizati<strong>on</strong><br />

• To achieve high performance <strong>on</strong> heterogeneous systems,<br />

applicati<strong>on</strong>s should exploit all compute resources<br />

• Problem: “hotspot analysis” of today’s performance<br />

analysis tools is insufficient for tuning<br />

✦<br />

✦<br />

✦<br />

Focuses <strong>on</strong> “most c<strong>on</strong>sumed” resources<br />

Identifies <strong>on</strong>ly symptoms of performance problems<br />

Does not identify causes of problems<br />

Milind Chabbi<br />

<str<strong>on</strong>g>HPCToolkit</str<strong>on</strong>g>: A Tool for <str<strong>on</strong>g>Performance</str<strong>on</strong>g> <str<strong>on</strong>g>Analysis</str<strong>on</strong>g> <strong>on</strong> <strong>Heterogeneous</strong> Supercomputers


Our Methodology: Pinpoint and Quantify Idleness<br />

• Quantify resource idleness<br />

• Attribute idleness to its causes<br />

✦<br />

✦<br />

If GPU is idle → blame CPU code for not offloading (enough) work<br />

If CPU is waiting for results from GPU → blame GPU kernel(s)<br />

involved<br />

Milind Chabbi<br />

<str<strong>on</strong>g>HPCToolkit</str<strong>on</strong>g>: A Tool for <str<strong>on</strong>g>Performance</str<strong>on</strong>g> <str<strong>on</strong>g>Analysis</str<strong>on</strong>g> <strong>on</strong> <strong>Heterogeneous</strong> Supercomputers


Our Methodology: Pinpoint and Quantify Idleness<br />

• Quantify resource idleness<br />

• Attribute idleness to its causes<br />

✦<br />

✦<br />

If GPU is idle → blame CPU code for not offloading (enough) work<br />

If CPU is waiting for results from GPU → blame GPU kernel(s)<br />

involved<br />

a.k.a. “CPU-GPU Blame Shifting”<br />

Milind Chabbi<br />

<str<strong>on</strong>g>HPCToolkit</str<strong>on</strong>g>: A Tool for <str<strong>on</strong>g>Performance</str<strong>on</strong>g> <str<strong>on</strong>g>Analysis</str<strong>on</strong>g> <strong>on</strong> <strong>Heterogeneous</strong> Supercomputers


Quantifying Tuning Opportunities with<br />

CPU-GPU Blame Shifting - 1<br />

5% Idle 40% Idle<br />

CPU<br />

WORK<br />

SYNC<br />

CPU<br />

WORK<br />

SYNC<br />

Kernel A<br />

Kernel B<br />

Time<br />

5% expected gain by<br />

tuning Kernel A<br />

40% expected gain<br />

by tuning Kernel B<br />

Top GPU-kernel may not be the best candidate for tuning<br />

Milind Chabbi<br />

<str<strong>on</strong>g>HPCToolkit</str<strong>on</strong>g>: A Tool for <str<strong>on</strong>g>Performance</str<strong>on</strong>g> <str<strong>on</strong>g>Analysis</str<strong>on</strong>g> <strong>on</strong> <strong>Heterogeneous</strong> Supercomputers


Quantifying Tuning Opportunities with<br />

CPU-GPU Blame Shifting - 1<br />

5% Idle 40% Idle<br />

CPU<br />

WORK<br />

SYNC<br />

CPU<br />

WORK<br />

SYNC<br />

Kernel A<br />

Kernel B<br />

Time<br />

5% expected gain by<br />

tuning Kernel A<br />

40% expected gain<br />

by tuning Kernel B<br />

Hotspot analysis<br />

Blame shifting<br />

Top GPU-kernel may not be the best candidate for tuning<br />

Milind Chabbi<br />

<str<strong>on</strong>g>HPCToolkit</str<strong>on</strong>g>: A Tool for <str<strong>on</strong>g>Performance</str<strong>on</strong>g> <str<strong>on</strong>g>Analysis</str<strong>on</strong>g> <strong>on</strong> <strong>Heterogeneous</strong> Supercomputers


Hotspot vs. Blame Shifting <strong>on</strong> LULESH<br />

CalcFBHourglassForceForElems 1<br />

CalcKinematicsForElems 2<br />

CalcM<strong>on</strong>ot<strong>on</strong>icQGradientsForElems 3<br />

CalcHourglassC<strong>on</strong>trolForElems 4<br />

IntegrateStressForElems 5<br />

AddNodeForcesFromElems2 6<br />

AddNodeForcesFromElems 7<br />

CalcPressureForElems 8<br />

CalcM<strong>on</strong>ot<strong>on</strong>icQRegi<strong>on</strong>ForElems 9<br />

EvalEOSForElemsPart1 10<br />

CalcEnergyForElemsPart2 11<br />

CalcEnergyForElemsPart3 12<br />

CalcPositi<strong>on</strong>ForNodes 13<br />

CalcVelocityForNodes 14<br />

CalcEnergyForElemsPart4 15<br />

CalcAccelerati<strong>on</strong>ForNodes 16<br />

CalcLagrangeElementsPart2 17<br />

EvalEOSForElemsPart2 18<br />

CalcSoundSpeedForElems 19<br />

CalcEnergyForElemsPart1 20<br />

InitStressTermsForElems 21<br />

CalcCourantC<strong>on</strong>straintForElems 22<br />

CalcHydroC<strong>on</strong>straintForElems 23<br />

ApplyMaterialPropertiesForElemsPart1 24<br />

UpdateVolumesForElems 25<br />

ApplyAccelerati<strong>on</strong>BoundaryC<strong>on</strong>diti<strong>on</strong>sForNo 26<br />

des<br />

Milind Chabbi<br />

Hotspot<br />

1 CalcFBHourglassForceForElems<br />

2 CalcHourglassC<strong>on</strong>trolForElems<br />

3 IntegrateStressForElems<br />

4 AddNodeForcesFromElems2<br />

5 AddNodeForcesFromElems<br />

6 CalcKinematicsForElems<br />

7 EvalEOSForElemsPart2<br />

8 CalcSoundSpeedForElems<br />

9 CalcLagrangeElementsPart2<br />

10 InitStressTermsForElems<br />

11 CalcCourantC<strong>on</strong>straintForElems<br />

12 CalcHydroC<strong>on</strong>straintForElems<br />

13 UpdateVolumesForElems<br />

14 CalcM<strong>on</strong>ot<strong>on</strong>icQGradientsForElems<br />

15 EvalEOSForElemsPart1<br />

16 CalcM<strong>on</strong>ot<strong>on</strong>icQRegi<strong>on</strong>ForElems<br />

17 CalcPressureForElems<br />

18 CalcEnergyForElemsPart2<br />

19 CalcEnergyForElemsPart3<br />

20 CalcEnergyForElemsPart4<br />

21 ApplyMaterialPropertiesForElemsPart1<br />

22 CalcEnergyForElemsPart1<br />

23 CalcPositi<strong>on</strong>ForNodes<br />

24 CalcVelocityForNodes<br />

25 CalcAccelerati<strong>on</strong>ForNodes<br />

26 ApplyAccelerati<strong>on</strong>BoundaryC<strong>on</strong>diti<strong>on</strong>sForNo<br />

des<br />

Blame Shifting<br />

<str<strong>on</strong>g>HPCToolkit</str<strong>on</strong>g>: A Tool for <str<strong>on</strong>g>Performance</str<strong>on</strong>g> <str<strong>on</strong>g>Analysis</str<strong>on</strong>g> <strong>on</strong> <strong>Heterogeneous</strong> Supercomputers


Hotspot vs. Blame Shifting <strong>on</strong> LULESH<br />

CalcFBHourglassForceForElems 1<br />

CalcKinematicsForElems 2<br />

CalcM<strong>on</strong>ot<strong>on</strong>icQGradientsForElems 3<br />

CalcHourglassC<strong>on</strong>trolForElems 4<br />

IntegrateStressForElems 5<br />

AddNodeForcesFromElems2 6<br />

AddNodeForcesFromElems 7<br />

CalcPressureForElems 8<br />

CalcM<strong>on</strong>ot<strong>on</strong>icQRegi<strong>on</strong>ForElems 9<br />

EvalEOSForElemsPart1 10<br />

CalcEnergyForElemsPart2 11<br />

CalcEnergyForElemsPart3 12<br />

CalcPositi<strong>on</strong>ForNodes 13<br />

CalcVelocityForNodes 14<br />

CalcEnergyForElemsPart4 15<br />

CalcAccelerati<strong>on</strong>ForNodes 16<br />

CalcLagrangeElementsPart2 17<br />

EvalEOSForElemsPart2 18<br />

CalcSoundSpeedForElems 19<br />

CalcEnergyForElemsPart1 20<br />

InitStressTermsForElems 21<br />

CalcCourantC<strong>on</strong>straintForElems 22<br />

CalcHydroC<strong>on</strong>straintForElems 23<br />

ApplyMaterialPropertiesForElemsPart1 24<br />

UpdateVolumesForElems 25<br />

ApplyAccelerati<strong>on</strong>BoundaryC<strong>on</strong>diti<strong>on</strong>sForNo 26<br />

des<br />

Milind Chabbi<br />

Hotspot<br />

1 CalcFBHourglassForceForElems<br />

2 CalcHourglassC<strong>on</strong>trolForElems<br />

3 IntegrateStressForElems<br />

4 AddNodeForcesFromElems2<br />

5 AddNodeForcesFromElems<br />

6 CalcKinematicsForElems<br />

7 EvalEOSForElemsPart2<br />

8 CalcSoundSpeedForElems<br />

9 CalcLagrangeElementsPart2<br />

10 InitStressTermsForElems<br />

11 CalcCourantC<strong>on</strong>straintForElems<br />

12 CalcHydroC<strong>on</strong>straintForElems<br />

13 UpdateVolumesForElems<br />

14 CalcM<strong>on</strong>ot<strong>on</strong>icQGradientsForElems<br />

15 EvalEOSForElemsPart1<br />

16 CalcM<strong>on</strong>ot<strong>on</strong>icQRegi<strong>on</strong>ForElems<br />

17 CalcPressureForElems<br />

18 CalcEnergyForElemsPart2<br />

19 CalcEnergyForElemsPart3<br />

20 CalcEnergyForElemsPart4<br />

21 ApplyMaterialPropertiesForElemsPart1<br />

22 CalcEnergyForElemsPart1<br />

23 CalcPositi<strong>on</strong>ForNodes<br />

24 CalcVelocityForNodes<br />

25 CalcAccelerati<strong>on</strong>ForNodes<br />

26 ApplyAccelerati<strong>on</strong>BoundaryC<strong>on</strong>diti<strong>on</strong>sForNo<br />

des<br />

Blame Shifting<br />

<str<strong>on</strong>g>HPCToolkit</str<strong>on</strong>g>: A Tool for <str<strong>on</strong>g>Performance</str<strong>on</strong>g> <str<strong>on</strong>g>Analysis</str<strong>on</strong>g> <strong>on</strong> <strong>Heterogeneous</strong> Supercomputers


Hotspot vs. Blame Shifting <strong>on</strong> LULESH<br />

CalcFBHourglassForceForElems 1<br />

CalcKinematicsForElems 2<br />

CalcM<strong>on</strong>ot<strong>on</strong>icQGradientsForElems 3<br />

CalcHourglassC<strong>on</strong>trolForElems 4<br />

IntegrateStressForElems 5<br />

AddNodeForcesFromElems2 6<br />

AddNodeForcesFromElems 7<br />

CalcPressureForElems 8<br />

CalcM<strong>on</strong>ot<strong>on</strong>icQRegi<strong>on</strong>ForElems 9<br />

EvalEOSForElemsPart1 10<br />

CalcEnergyForElemsPart2 11<br />

CalcEnergyForElemsPart3 12<br />

CalcPositi<strong>on</strong>ForNodes 13<br />

CalcVelocityForNodes 14<br />

CalcEnergyForElemsPart4 15<br />

CalcAccelerati<strong>on</strong>ForNodes 16<br />

CalcLagrangeElementsPart2 17<br />

EvalEOSForElemsPart2 18<br />

CalcSoundSpeedForElems 19<br />

CalcEnergyForElemsPart1 20<br />

InitStressTermsForElems 21<br />

CalcCourantC<strong>on</strong>straintForElems 22<br />

CalcHydroC<strong>on</strong>straintForElems 23<br />

ApplyMaterialPropertiesForElemsPart1 24<br />

UpdateVolumesForElems 25<br />

ApplyAccelerati<strong>on</strong>BoundaryC<strong>on</strong>diti<strong>on</strong>sForNo 26<br />

des<br />

Milind Chabbi<br />

Hotspot<br />

1 CalcFBHourglassForceForElems<br />

2 CalcHourglassC<strong>on</strong>trolForElems<br />

3 IntegrateStressForElems<br />

4 AddNodeForcesFromElems2<br />

5 AddNodeForcesFromElems<br />

6 CalcKinematicsForElems<br />

7 EvalEOSForElemsPart2<br />

8 CalcSoundSpeedForElems<br />

9 CalcLagrangeElementsPart2<br />

10 InitStressTermsForElems<br />

11 CalcCourantC<strong>on</strong>straintForElems<br />

12 CalcHydroC<strong>on</strong>straintForElems<br />

13 UpdateVolumesForElems<br />

14 CalcM<strong>on</strong>ot<strong>on</strong>icQGradientsForElems<br />

15 EvalEOSForElemsPart1<br />

16 CalcM<strong>on</strong>ot<strong>on</strong>icQRegi<strong>on</strong>ForElems<br />

17 CalcPressureForElems<br />

18 CalcEnergyForElemsPart2<br />

19 CalcEnergyForElemsPart3<br />

20 CalcEnergyForElemsPart4<br />

21 ApplyMaterialPropertiesForElemsPart1<br />

22 CalcEnergyForElemsPart1<br />

23 CalcPositi<strong>on</strong>ForNodes<br />

24 CalcVelocityForNodes<br />

25 CalcAccelerati<strong>on</strong>ForNodes<br />

26 ApplyAccelerati<strong>on</strong>BoundaryC<strong>on</strong>diti<strong>on</strong>sForNo<br />

des<br />

Blame Shifting<br />

<str<strong>on</strong>g>HPCToolkit</str<strong>on</strong>g>: A Tool for <str<strong>on</strong>g>Performance</str<strong>on</strong>g> <str<strong>on</strong>g>Analysis</str<strong>on</strong>g> <strong>on</strong> <strong>Heterogeneous</strong> Supercomputers


Hotspot vs. Blame Shifting <strong>on</strong> LULESH<br />

CalcFBHourglassForceForElems 1<br />

CalcKinematicsForElems 2<br />

CalcM<strong>on</strong>ot<strong>on</strong>icQGradientsForElems 3<br />

CalcHourglassC<strong>on</strong>trolForElems 4<br />

IntegrateStressForElems 5<br />

AddNodeForcesFromElems2 6<br />

AddNodeForcesFromElems 7<br />

CalcPressureForElems 8<br />

CalcM<strong>on</strong>ot<strong>on</strong>icQRegi<strong>on</strong>ForElems 9<br />

EvalEOSForElemsPart1 10<br />

CalcEnergyForElemsPart2 11<br />

CalcEnergyForElemsPart3 12<br />

CalcPositi<strong>on</strong>ForNodes 13<br />

CalcVelocityForNodes 14<br />

CalcEnergyForElemsPart4 15<br />

CalcAccelerati<strong>on</strong>ForNodes 16<br />

CalcLagrangeElementsPart2 17<br />

EvalEOSForElemsPart2 18<br />

CalcSoundSpeedForElems 19<br />

CalcEnergyForElemsPart1 20<br />

InitStressTermsForElems 21<br />

CalcCourantC<strong>on</strong>straintForElems 22<br />

CalcHydroC<strong>on</strong>straintForElems 23<br />

ApplyMaterialPropertiesForElemsPart1 24<br />

UpdateVolumesForElems 25<br />

ApplyAccelerati<strong>on</strong>BoundaryC<strong>on</strong>diti<strong>on</strong>sForNo 26<br />

des<br />

Milind Chabbi<br />

Hotspot<br />

1 CalcFBHourglassForceForElems<br />

2 CalcHourglassC<strong>on</strong>trolForElems<br />

3 IntegrateStressForElems<br />

4 AddNodeForcesFromElems2<br />

5 AddNodeForcesFromElems<br />

6 CalcKinematicsForElems<br />

7 EvalEOSForElemsPart2<br />

8 CalcSoundSpeedForElems<br />

9 CalcLagrangeElementsPart2<br />

10 InitStressTermsForElems<br />

11 CalcCourantC<strong>on</strong>straintForElems<br />

12 CalcHydroC<strong>on</strong>straintForElems<br />

13 UpdateVolumesForElems<br />

14 CalcM<strong>on</strong>ot<strong>on</strong>icQGradientsForElems<br />

15 EvalEOSForElemsPart1<br />

16 CalcM<strong>on</strong>ot<strong>on</strong>icQRegi<strong>on</strong>ForElems<br />

17 CalcPressureForElems<br />

18 CalcEnergyForElemsPart2<br />

19 CalcEnergyForElemsPart3<br />

20 CalcEnergyForElemsPart4<br />

21 ApplyMaterialPropertiesForElemsPart1<br />

22 CalcEnergyForElemsPart1<br />

23 CalcPositi<strong>on</strong>ForNodes<br />

24 CalcVelocityForNodes<br />

25 CalcAccelerati<strong>on</strong>ForNodes<br />

26 ApplyAccelerati<strong>on</strong>BoundaryC<strong>on</strong>diti<strong>on</strong>sForNo<br />

des<br />

Blame Shifting<br />

<str<strong>on</strong>g>HPCToolkit</str<strong>on</strong>g>: A Tool for <str<strong>on</strong>g>Performance</str<strong>on</strong>g> <str<strong>on</strong>g>Analysis</str<strong>on</strong>g> <strong>on</strong> <strong>Heterogeneous</strong> Supercomputers


Hotspot vs. Blame Shifting <strong>on</strong> LULESH<br />

CalcFBHourglassForceForElems 1<br />

CalcKinematicsForElems 2<br />

CalcM<strong>on</strong>ot<strong>on</strong>icQGradientsForElems 3<br />

CalcHourglassC<strong>on</strong>trolForElems 4<br />

IntegrateStressForElems 5<br />

AddNodeForcesFromElems2 6<br />

AddNodeForcesFromElems 7<br />

CalcPressureForElems 8<br />

CalcM<strong>on</strong>ot<strong>on</strong>icQRegi<strong>on</strong>ForElems 9<br />

EvalEOSForElemsPart1 10<br />

CalcEnergyForElemsPart2 11<br />

CalcEnergyForElemsPart3 12<br />

CalcPositi<strong>on</strong>ForNodes 13<br />

CalcVelocityForNodes 14<br />

CalcEnergyForElemsPart4 15<br />

CalcAccelerati<strong>on</strong>ForNodes 16<br />

CalcLagrangeElementsPart2 17<br />

EvalEOSForElemsPart2 18<br />

CalcSoundSpeedForElems 19<br />

CalcEnergyForElemsPart1 20<br />

InitStressTermsForElems 21<br />

CalcCourantC<strong>on</strong>straintForElems 22<br />

CalcHydroC<strong>on</strong>straintForElems 23<br />

ApplyMaterialPropertiesForElemsPart1 24<br />

UpdateVolumesForElems 25<br />

ApplyAccelerati<strong>on</strong>BoundaryC<strong>on</strong>diti<strong>on</strong>sForNo 26<br />

des<br />

Milind Chabbi<br />

Hotspot<br />

1 CalcFBHourglassForceForElems<br />

2 CalcHourglassC<strong>on</strong>trolForElems<br />

3 IntegrateStressForElems<br />

4 AddNodeForcesFromElems2<br />

5 AddNodeForcesFromElems<br />

6 CalcKinematicsForElems<br />

7 EvalEOSForElemsPart2<br />

8 CalcSoundSpeedForElems<br />

9 CalcLagrangeElementsPart2<br />

10 InitStressTermsForElems<br />

11 CalcCourantC<strong>on</strong>straintForElems<br />

12 CalcHydroC<strong>on</strong>straintForElems<br />

13 UpdateVolumesForElems<br />

14 CalcM<strong>on</strong>ot<strong>on</strong>icQGradientsForElems<br />

15 EvalEOSForElemsPart1<br />

16 CalcM<strong>on</strong>ot<strong>on</strong>icQRegi<strong>on</strong>ForElems<br />

17 CalcPressureForElems<br />

18 CalcEnergyForElemsPart2<br />

19 CalcEnergyForElemsPart3<br />

20 CalcEnergyForElemsPart4<br />

21 ApplyMaterialPropertiesForElemsPart1<br />

22 CalcEnergyForElemsPart1<br />

23 CalcPositi<strong>on</strong>ForNodes<br />

24 CalcVelocityForNodes<br />

25 CalcAccelerati<strong>on</strong>ForNodes<br />

26 ApplyAccelerati<strong>on</strong>BoundaryC<strong>on</strong>diti<strong>on</strong>sForNo<br />

des<br />

Blame Shifting<br />

<str<strong>on</strong>g>HPCToolkit</str<strong>on</strong>g>: A Tool for <str<strong>on</strong>g>Performance</str<strong>on</strong>g> <str<strong>on</strong>g>Analysis</str<strong>on</strong>g> <strong>on</strong> <strong>Heterogeneous</strong> Supercomputers


Hotspot vs. Blame Shifting <strong>on</strong> LULESH<br />

CalcFBHourglassForceForElems 1<br />

CalcKinematicsForElems 2<br />

CalcM<strong>on</strong>ot<strong>on</strong>icQGradientsForElems 3<br />

CalcHourglassC<strong>on</strong>trolForElems 4<br />

IntegrateStressForElems 5<br />

AddNodeForcesFromElems2 6<br />

AddNodeForcesFromElems 7<br />

CalcPressureForElems 8<br />

CalcM<strong>on</strong>ot<strong>on</strong>icQRegi<strong>on</strong>ForElems 9<br />

EvalEOSForElemsPart1 10<br />

CalcEnergyForElemsPart2 11<br />

CalcEnergyForElemsPart3 12<br />

CalcPositi<strong>on</strong>ForNodes 13<br />

CalcVelocityForNodes 14<br />

CalcEnergyForElemsPart4 15<br />

CalcAccelerati<strong>on</strong>ForNodes 16<br />

CalcLagrangeElementsPart2 17<br />

EvalEOSForElemsPart2 18<br />

CalcSoundSpeedForElems 19<br />

CalcEnergyForElemsPart1 20<br />

InitStressTermsForElems 21<br />

CalcCourantC<strong>on</strong>straintForElems 22<br />

CalcHydroC<strong>on</strong>straintForElems 23<br />

ApplyMaterialPropertiesForElemsPart1 24<br />

UpdateVolumesForElems 25<br />

ApplyAccelerati<strong>on</strong>BoundaryC<strong>on</strong>diti<strong>on</strong>sForNo 26<br />

des<br />

Milind Chabbi<br />

Hotspot<br />

1 CalcFBHourglassForceForElems<br />

2 CalcHourglassC<strong>on</strong>trolForElems<br />

3 IntegrateStressForElems<br />

4 AddNodeForcesFromElems2<br />

5 AddNodeForcesFromElems<br />

6 CalcKinematicsForElems<br />

7 EvalEOSForElemsPart2<br />

8 CalcSoundSpeedForElems<br />

9 CalcLagrangeElementsPart2<br />

10 InitStressTermsForElems<br />

11 CalcCourantC<strong>on</strong>straintForElems<br />

12 CalcHydroC<strong>on</strong>straintForElems<br />

13 UpdateVolumesForElems<br />

14 CalcM<strong>on</strong>ot<strong>on</strong>icQGradientsForElems<br />

15 EvalEOSForElemsPart1<br />

16 CalcM<strong>on</strong>ot<strong>on</strong>icQRegi<strong>on</strong>ForElems<br />

17 CalcPressureForElems<br />

18 CalcEnergyForElemsPart2<br />

19 CalcEnergyForElemsPart3<br />

20 CalcEnergyForElemsPart4<br />

21 ApplyMaterialPropertiesForElemsPart1<br />

22 CalcEnergyForElemsPart1<br />

23 CalcPositi<strong>on</strong>ForNodes<br />

24 CalcVelocityForNodes<br />

25 CalcAccelerati<strong>on</strong>ForNodes<br />

26 ApplyAccelerati<strong>on</strong>BoundaryC<strong>on</strong>diti<strong>on</strong>sForNo<br />

des<br />

Blame Shifting<br />

<str<strong>on</strong>g>HPCToolkit</str<strong>on</strong>g>: A Tool for <str<strong>on</strong>g>Performance</str<strong>on</strong>g> <str<strong>on</strong>g>Analysis</str<strong>on</strong>g> <strong>on</strong> <strong>Heterogeneous</strong> Supercomputers


Quantifying Tuning Opportunities with<br />

CPU-GPU Blame Shifting - 2<br />

Vice versa is also true<br />

CPU PART A<br />

S<br />

Y<br />

N<br />

C<br />

CPU PART B<br />

S<br />

Y<br />

N<br />

C<br />

KernelA<br />

GPU IDLE<br />

KernelB<br />

Time<br />

Tuning “CPU PART A” reduces critical path<br />

Milind Chabbi<br />

<str<strong>on</strong>g>HPCToolkit</str<strong>on</strong>g>: A Tool for <str<strong>on</strong>g>Performance</str<strong>on</strong>g> <str<strong>on</strong>g>Analysis</str<strong>on</strong>g> <strong>on</strong> <strong>Heterogeneous</strong> Supercomputers


<str<strong>on</strong>g>HPCToolkit</str<strong>on</strong>g><br />

<str<strong>on</strong>g>Performance</str<strong>on</strong>g> tool for large parallel systems<br />

• Supports multilingual, fully-optimized, statically or<br />

dynamically linked applicati<strong>on</strong>s (no source modificati<strong>on</strong>)<br />

• Measures performance using asynchr<strong>on</strong>ous sampling of<br />

timers and hardware performance counters<br />

✦<br />

Low overhead (under 5%) for both profiling and tracing<br />

• Attributes performance to full call paths<br />

✦<br />

Pthread, OpenMP, MPI, and any combinati<strong>on</strong><br />

• Decentralized → Scales to 1000s of nodes<br />

• Provides GUIs for code- and time- centric analysis<br />

Milind Chabbi<br />

<str<strong>on</strong>g>HPCToolkit</str<strong>on</strong>g>: A Tool for <str<strong>on</strong>g>Performance</str<strong>on</strong>g> <str<strong>on</strong>g>Analysis</str<strong>on</strong>g> <strong>on</strong> <strong>Heterogeneous</strong> Supercomputers


Enhancing <str<strong>on</strong>g>HPCToolkit</str<strong>on</strong>g> for<br />

<strong>Heterogeneous</strong> Supercomputers<br />

• Problem: GPUs (natively) d<strong>on</strong>’t support sampling<br />

• Approach: Instrument GPU runtime APIs to m<strong>on</strong>itor<br />

asynchr<strong>on</strong>ous activities via cudaEvents; use CPU events<br />

as sampling points to interrogate GPU<br />

• CPU-GPU blame shifting<br />

✦<br />

✦<br />

Attributes blame to causes rather than symptoms <strong>on</strong> the fly<br />

Delivers deep insight without recording activity traces<br />

• Support for full calling c<strong>on</strong>text <strong>on</strong> GPUs to distinguish<br />

same kernels launched from different calling c<strong>on</strong>texts<br />

Milind Chabbi<br />

<str<strong>on</strong>g>HPCToolkit</str<strong>on</strong>g>: A Tool for <str<strong>on</strong>g>Performance</str<strong>on</strong>g> <str<strong>on</strong>g>Analysis</str<strong>on</strong>g> <strong>on</strong> <strong>Heterogeneous</strong> Supercomputers


CPU-GPU Blame Shifting in Acti<strong>on</strong> in<br />

<str<strong>on</strong>g>HPCToolkit</str<strong>on</strong>g><br />

CPU<br />

hread<br />

GPU<br />

stream<br />

Time<br />

Milind Chabbi<br />

<str<strong>on</strong>g>HPCToolkit</str<strong>on</strong>g>: A Tool for <str<strong>on</strong>g>Performance</str<strong>on</strong>g> <str<strong>on</strong>g>Analysis</str<strong>on</strong>g> <strong>on</strong> <strong>Heterogeneous</strong> Supercomputers


CPU-GPU Blame Shifting in Acti<strong>on</strong> in<br />

<str<strong>on</strong>g>HPCToolkit</str<strong>on</strong>g><br />

CPU<br />

K<br />

e<br />

r<br />

n<br />

e<br />

l<br />

hread<br />

GPU<br />

stream<br />

Time<br />

Milind Chabbi<br />

<str<strong>on</strong>g>HPCToolkit</str<strong>on</strong>g>: A Tool for <str<strong>on</strong>g>Performance</str<strong>on</strong>g> <str<strong>on</strong>g>Analysis</str<strong>on</strong>g> <strong>on</strong> <strong>Heterogeneous</strong> Supercomputers


CPU-GPU Blame Shifting in Acti<strong>on</strong> in<br />

<str<strong>on</strong>g>HPCToolkit</str<strong>on</strong>g><br />

CPU<br />

e<br />

v<br />

e<br />

n<br />

t<br />

K<br />

e<br />

r<br />

n<br />

e<br />

l<br />

e<br />

v<br />

e<br />

n<br />

t<br />

hread<br />

GPU<br />

stream<br />

Time<br />

Milind Chabbi<br />

<str<strong>on</strong>g>HPCToolkit</str<strong>on</strong>g>: A Tool for <str<strong>on</strong>g>Performance</str<strong>on</strong>g> <str<strong>on</strong>g>Analysis</str<strong>on</strong>g> <strong>on</strong> <strong>Heterogeneous</strong> Supercomputers


CPU-GPU Blame Shifting in Acti<strong>on</strong> in<br />

<str<strong>on</strong>g>HPCToolkit</str<strong>on</strong>g><br />

CPU<br />

e<br />

v<br />

e<br />

n<br />

t<br />

K<br />

e<br />

r<br />

n<br />

e<br />

l<br />

e<br />

v<br />

e<br />

n<br />

t<br />

Overlap<br />

cudaDevice<br />

Synchr<strong>on</strong>ize()<br />

hread<br />

GPU<br />

e<br />

v<br />

e<br />

n<br />

t<br />

Kernel Executi<strong>on</strong><br />

stream<br />

60%<br />

Time<br />

40%<br />

Milind Chabbi<br />

<str<strong>on</strong>g>HPCToolkit</str<strong>on</strong>g>: A Tool for <str<strong>on</strong>g>Performance</str<strong>on</strong>g> <str<strong>on</strong>g>Analysis</str<strong>on</strong>g> <strong>on</strong> <strong>Heterogeneous</strong> Supercomputers


CPU-GPU Blame Shifting in Acti<strong>on</strong> in<br />

<str<strong>on</strong>g>HPCToolkit</str<strong>on</strong>g><br />

CPU<br />

hread<br />

e<br />

v<br />

e<br />

n<br />

t<br />

K<br />

e<br />

r<br />

n<br />

e<br />

l<br />

e<br />

v<br />

e<br />

n<br />

t<br />

EventQuery<br />

Overlap<br />

cudaDevice<br />

Synchr<strong>on</strong>ize()<br />

GPU<br />

e<br />

v<br />

e<br />

n<br />

t<br />

Kernel Executi<strong>on</strong><br />

stream<br />

60%<br />

Time<br />

40%<br />

Milind Chabbi<br />

<str<strong>on</strong>g>HPCToolkit</str<strong>on</strong>g>: A Tool for <str<strong>on</strong>g>Performance</str<strong>on</strong>g> <str<strong>on</strong>g>Analysis</str<strong>on</strong>g> <strong>on</strong> <strong>Heterogeneous</strong> Supercomputers


CPU-GPU Blame Shifting in Acti<strong>on</strong> in<br />

<str<strong>on</strong>g>HPCToolkit</str<strong>on</strong>g><br />

CPU<br />

hread<br />

e<br />

v<br />

e<br />

n<br />

t<br />

K<br />

e<br />

r<br />

n<br />

e<br />

l<br />

e<br />

v<br />

e<br />

n<br />

t<br />

EventQuery<br />

Overlap<br />

EventQuery<br />

cudaDevice<br />

Synchr<strong>on</strong>ize()<br />

GPU<br />

e<br />

v<br />

e<br />

n<br />

t<br />

Kernel Executi<strong>on</strong><br />

stream<br />

60%<br />

Time<br />

40%<br />

Milind Chabbi<br />

<str<strong>on</strong>g>HPCToolkit</str<strong>on</strong>g>: A Tool for <str<strong>on</strong>g>Performance</str<strong>on</strong>g> <str<strong>on</strong>g>Analysis</str<strong>on</strong>g> <strong>on</strong> <strong>Heterogeneous</strong> Supercomputers


CPU-GPU Blame Shifting in Acti<strong>on</strong> in<br />

<str<strong>on</strong>g>HPCToolkit</str<strong>on</strong>g><br />

CPU<br />

hread<br />

e<br />

v<br />

e<br />

n<br />

t<br />

K<br />

e<br />

r<br />

n<br />

e<br />

l<br />

e<br />

v<br />

e<br />

n<br />

t<br />

EventQuery<br />

Overlap<br />

EventQuery<br />

cudaDevice<br />

Synchr<strong>on</strong>ize()<br />

EventQuery<br />

EventQuery<br />

GPU<br />

e<br />

v<br />

e<br />

n<br />

t<br />

Kernel Executi<strong>on</strong><br />

GPU ONLY<br />

stream<br />

60%<br />

Time<br />

40%<br />

Milind Chabbi<br />

<str<strong>on</strong>g>HPCToolkit</str<strong>on</strong>g>: A Tool for <str<strong>on</strong>g>Performance</str<strong>on</strong>g> <str<strong>on</strong>g>Analysis</str<strong>on</strong>g> <strong>on</strong> <strong>Heterogeneous</strong> Supercomputers


CPU-GPU Blame Shifting in Acti<strong>on</strong> in<br />

<str<strong>on</strong>g>HPCToolkit</str<strong>on</strong>g><br />

CPU<br />

hread<br />

e<br />

v<br />

e<br />

n<br />

t<br />

K<br />

e<br />

r<br />

n<br />

e<br />

l<br />

e<br />

v<br />

e<br />

n<br />

t<br />

EventQuery<br />

Overlap<br />

EventQuery<br />

cudaDevice<br />

Synchr<strong>on</strong>ize()<br />

CPU IDLE<br />

EventQuery<br />

EventQuery<br />

GPU<br />

e<br />

v<br />

e<br />

n<br />

t<br />

Kernel Executi<strong>on</strong><br />

GPU ONLY<br />

stream<br />

60%<br />

Time<br />

40%<br />

Milind Chabbi<br />

<str<strong>on</strong>g>HPCToolkit</str<strong>on</strong>g>: A Tool for <str<strong>on</strong>g>Performance</str<strong>on</strong>g> <str<strong>on</strong>g>Analysis</str<strong>on</strong>g> <strong>on</strong> <strong>Heterogeneous</strong> Supercomputers


CPU-GPU Blame Shifting in Acti<strong>on</strong> in<br />

<str<strong>on</strong>g>HPCToolkit</str<strong>on</strong>g><br />

CPU<br />

hread<br />

e<br />

v<br />

e<br />

n<br />

t<br />

K<br />

e<br />

r<br />

n<br />

e<br />

l<br />

e<br />

v<br />

e<br />

n<br />

t<br />

EventQuery<br />

Overlap<br />

EventQuery<br />

cudaDevice<br />

Synchr<strong>on</strong>ize()<br />

CPU IDLE<br />

EventQuery<br />

EventQuery<br />

GPU<br />

e<br />

v<br />

e<br />

n<br />

t<br />

Kernel Executi<strong>on</strong><br />

GPU ONLY<br />

Blame<br />

stream<br />

60%<br />

Time<br />

40%<br />

Milind Chabbi<br />

<str<strong>on</strong>g>HPCToolkit</str<strong>on</strong>g>: A Tool for <str<strong>on</strong>g>Performance</str<strong>on</strong>g> <str<strong>on</strong>g>Analysis</str<strong>on</strong>g> <strong>on</strong> <strong>Heterogeneous</strong> Supercomputers


CPU-GPU Blame Shifting in Acti<strong>on</strong> in<br />

<str<strong>on</strong>g>HPCToolkit</str<strong>on</strong>g><br />

CPU<br />

hread<br />

e<br />

v<br />

e<br />

n<br />

t<br />

K<br />

e<br />

r<br />

n<br />

e<br />

l<br />

e<br />

v<br />

e<br />

n<br />

t<br />

EventQuery<br />

Overlap<br />

EventQuery<br />

cudaDevice<br />

Synchr<strong>on</strong>ize()<br />

CPU IDLE<br />

EventQuery<br />

EventQuery<br />

GPU<br />

e<br />

v<br />

e<br />

n<br />

t<br />

Kernel Executi<strong>on</strong><br />

GPU ONLY<br />

Blame<br />

e<br />

v<br />

e<br />

n<br />

t<br />

stream<br />

60%<br />

Time<br />

40%<br />

Milind Chabbi<br />

<str<strong>on</strong>g>HPCToolkit</str<strong>on</strong>g>: A Tool for <str<strong>on</strong>g>Performance</str<strong>on</strong>g> <str<strong>on</strong>g>Analysis</str<strong>on</strong>g> <strong>on</strong> <strong>Heterogeneous</strong> Supercomputers


CPU-GPU Blame Shifting in Acti<strong>on</strong> in<br />

<str<strong>on</strong>g>HPCToolkit</str<strong>on</strong>g><br />

CPU<br />

hread<br />

e<br />

v<br />

e<br />

n<br />

t<br />

K<br />

e<br />

r<br />

n<br />

e<br />

l<br />

e<br />

v<br />

e<br />

n<br />

t<br />

EventQuery<br />

Overlap<br />

EventQuery<br />

cudaDevice<br />

Synchr<strong>on</strong>ize()<br />

CPU IDLE<br />

EventQuery<br />

EventQuery<br />

CPU ONLY<br />

GPU idle<br />

GPU idle<br />

GPU<br />

e<br />

v<br />

e<br />

n<br />

t<br />

Kernel Executi<strong>on</strong><br />

GPU ONLY<br />

Blame<br />

e<br />

v<br />

e<br />

n<br />

t<br />

stream<br />

60%<br />

Time<br />

40%<br />

Milind Chabbi<br />

<str<strong>on</strong>g>HPCToolkit</str<strong>on</strong>g>: A Tool for <str<strong>on</strong>g>Performance</str<strong>on</strong>g> <str<strong>on</strong>g>Analysis</str<strong>on</strong>g> <strong>on</strong> <strong>Heterogeneous</strong> Supercomputers


CPU-GPU Blame Shifting in Acti<strong>on</strong> in<br />

<str<strong>on</strong>g>HPCToolkit</str<strong>on</strong>g><br />

CPU<br />

hread<br />

e<br />

v<br />

e<br />

n<br />

t<br />

K<br />

e<br />

r<br />

n<br />

e<br />

l<br />

e<br />

v<br />

e<br />

n<br />

t<br />

EventQuery<br />

Overlap<br />

EventQuery<br />

cudaDevice<br />

Synchr<strong>on</strong>ize()<br />

CPU IDLE<br />

EventQuery<br />

EventQuery<br />

CPU ONLY<br />

GPU idle<br />

GPU idle<br />

GPU<br />

e<br />

v<br />

e<br />

n<br />

t<br />

Kernel Executi<strong>on</strong><br />

GPU ONLY<br />

Blame<br />

e<br />

v<br />

e<br />

n<br />

t<br />

GPU IDLE<br />

stream<br />

60%<br />

Time<br />

40%<br />

Milind Chabbi<br />

<str<strong>on</strong>g>HPCToolkit</str<strong>on</strong>g>: A Tool for <str<strong>on</strong>g>Performance</str<strong>on</strong>g> <str<strong>on</strong>g>Analysis</str<strong>on</strong>g> <strong>on</strong> <strong>Heterogeneous</strong> Supercomputers


CPU-GPU Blame Shifting in Acti<strong>on</strong> in<br />

<str<strong>on</strong>g>HPCToolkit</str<strong>on</strong>g><br />

CPU<br />

hread<br />

e<br />

v<br />

e<br />

n<br />

t<br />

K<br />

e<br />

r<br />

n<br />

e<br />

l<br />

e<br />

v<br />

e<br />

n<br />

t<br />

EventQuery<br />

Overlap<br />

EventQuery<br />

cudaDevice<br />

Synchr<strong>on</strong>ize()<br />

CPU IDLE<br />

EventQuery<br />

EventQuery<br />

CPU ONLY<br />

GPU idle<br />

Blame<br />

GPU idle<br />

GPU<br />

e<br />

v<br />

e<br />

n<br />

t<br />

Kernel Executi<strong>on</strong><br />

GPU ONLY<br />

Blame<br />

e<br />

v<br />

e<br />

n<br />

t<br />

GPU IDLE<br />

stream<br />

60%<br />

Time<br />

40%<br />

Milind Chabbi<br />

<str<strong>on</strong>g>HPCToolkit</str<strong>on</strong>g>: A Tool for <str<strong>on</strong>g>Performance</str<strong>on</strong>g> <str<strong>on</strong>g>Analysis</str<strong>on</strong>g> <strong>on</strong> <strong>Heterogeneous</strong> Supercomputers


Advanced Features<br />

• <str<strong>on</strong>g>HPCToolkit</str<strong>on</strong>g>’s CPU-GPU blame shifting supports<br />

✦<br />

✦<br />

✦<br />

✦<br />

Multiple GPU streams<br />

Multiple threads in a process<br />

Multiple MPI ranks sharing same GPU<br />

Any combinati<strong>on</strong> of above<br />

• GPU hardware counters for kernel-level hotspot analysis<br />

• <str<strong>on</strong>g>Performance</str<strong>on</strong>g> modeling when multiplexing GPUs<br />

Milind Chabbi<br />

<str<strong>on</strong>g>HPCToolkit</str<strong>on</strong>g>: A Tool for <str<strong>on</strong>g>Performance</str<strong>on</strong>g> <str<strong>on</strong>g>Analysis</str<strong>on</strong>g> <strong>on</strong> <strong>Heterogeneous</strong> Supercomputers


Insights via Blame Shifting in<br />

LULESH CUDA<br />

• LULESH: Shock Hydrodynamics code<br />

✦<br />

✦<br />

One of five challenge problems in the DARPA UHPC program<br />

Compute intensive simulati<strong>on</strong> of complex multi-material moti<strong>on</strong><br />

• CUDA versi<strong>on</strong> available from LLNL<br />

Milind Chabbi<br />

<str<strong>on</strong>g>HPCToolkit</str<strong>on</strong>g>: A Tool for <str<strong>on</strong>g>Performance</str<strong>on</strong>g> <str<strong>on</strong>g>Analysis</str<strong>on</strong>g> <strong>on</strong> <strong>Heterogeneous</strong> Supercomputers


Milind Chabbi<br />

<str<strong>on</strong>g>HPCToolkit</str<strong>on</strong>g>: A Tool for <str<strong>on</strong>g>Performance</str<strong>on</strong>g> <str<strong>on</strong>g>Analysis</str<strong>on</strong>g> <strong>on</strong> <strong>Heterogeneous</strong> Supercomputers


Insights via Blame Shifting in<br />

LULESH CUDA<br />

CPU<br />

CPU IDLE<br />

CudaMalloc()<br />

CudaFree()<br />

GPU IDLE<br />

GPU IDLE<br />

GPU<br />

Time<br />

Milind Chabbi<br />

<str<strong>on</strong>g>HPCToolkit</str<strong>on</strong>g>: A Tool for <str<strong>on</strong>g>Performance</str<strong>on</strong>g> <str<strong>on</strong>g>Analysis</str<strong>on</strong>g> <strong>on</strong> <strong>Heterogeneous</strong> Supercomputers


LULESH CUDA Memory Allocati<strong>on</strong><br />

9.E+07<br />

8.E+07<br />

Total bytes allocated<br />

7.E+07<br />

6.E+07<br />

5.E+07<br />

4.E+07<br />

3.E+07<br />

2.E+07<br />

1.E+07<br />

0.E+00<br />

Time step 1 Time step 2 Time step 3 Time step 4<br />

1<br />

11<br />

21<br />

31<br />

41<br />

51<br />

61<br />

71<br />

81<br />

91<br />

101<br />

111<br />

121<br />

131<br />

141<br />

151<br />

161<br />

171<br />

181<br />

191<br />

201<br />

211<br />

221<br />

231<br />

241<br />

251<br />

261<br />

271<br />

281<br />

291<br />

301<br />

311<br />

321<br />

Memory request with time<br />

Milind Chabbi<br />

<str<strong>on</strong>g>HPCToolkit</str<strong>on</strong>g>: A Tool for <str<strong>on</strong>g>Performance</str<strong>on</strong>g> <str<strong>on</strong>g>Analysis</str<strong>on</strong>g> <strong>on</strong> <strong>Heterogeneous</strong> Supercomputers


LULESH CUDA Memory Allocati<strong>on</strong><br />

9.E+07<br />

8.E+07<br />

Total bytes allocated<br />

7.E+07<br />

6.E+07<br />

5.E+07<br />

4.E+07<br />

3.E+07<br />

2.E+07<br />

1.E+07<br />

0.E+00<br />

Time step 1 Time step 2 Time step 3 Time step 4<br />

1<br />

11<br />

21<br />

31<br />

41<br />

51<br />

61<br />

71<br />

81<br />

91<br />

101<br />

111<br />

121<br />

131<br />

141<br />

151<br />

161<br />

171<br />

181<br />

191<br />

201<br />

211<br />

221<br />

231<br />

241<br />

251<br />

261<br />

271<br />

281<br />

291<br />

301<br />

311<br />

321<br />

Memory request with time<br />

Hoisting repeated allocati<strong>on</strong>/free improved GPU utilizati<strong>on</strong> from<br />

65% to 95% and reduced executi<strong>on</strong> time by 30%<br />

Milind Chabbi<br />

<str<strong>on</strong>g>HPCToolkit</str<strong>on</strong>g>: A Tool for <str<strong>on</strong>g>Performance</str<strong>on</strong>g> <str<strong>on</strong>g>Analysis</str<strong>on</strong>g> <strong>on</strong> <strong>Heterogeneous</strong> Supercomputers


Understanding Temporal Behavior<br />

• Profiling compresses out the temporal dimensi<strong>on</strong><br />

✦<br />

Temporal behavior e.g., serializati<strong>on</strong> are invisible in profiles<br />

• What can we do Trace call path samples<br />

✦<br />

✦<br />

✦<br />

On each sample trace call path sample for each CPU and GPU<br />

Organize samples al<strong>on</strong>g a timeline<br />

Assign each procedure a color; view depth slice of an executi<strong>on</strong><br />

Compute<br />

resources<br />

Orders of<br />

magnitude smaller<br />

trace sizes<br />

Milind Chabbi<br />

<str<strong>on</strong>g>HPCToolkit</str<strong>on</strong>g>: A Tool for <str<strong>on</strong>g>Performance</str<strong>on</strong>g> <str<strong>on</strong>g>Analysis</str<strong>on</strong>g> <strong>on</strong> <strong>Heterogeneous</strong> Supercomputers


HPCTraceviewer to Identify<br />

<str<strong>on</strong>g>Performance</str<strong>on</strong>g> Issue <strong>on</strong><br />

Keeneland Cluster<br />

DEMO<br />

Milind Chabbi<br />

<str<strong>on</strong>g>HPCToolkit</str<strong>on</strong>g>: A Tool for <str<strong>on</strong>g>Performance</str<strong>on</strong>g> <str<strong>on</strong>g>Analysis</str<strong>on</strong>g> <strong>on</strong> <strong>Heterogeneous</strong> Supercomputers


C<strong>on</strong>clusi<strong>on</strong>s<br />

<str<strong>on</strong>g>HPCToolkit</str<strong>on</strong>g> is a performance analysis tool for<br />

heterogeneous systems<br />

✦<br />

Sampling-centric approach<br />

★<br />

Low overhead and compact measurement data<br />

✦<br />

CPU-GPU blame shifting<br />

★<br />

Pinpoints and quantifies code fragments (CPU and GPU)<br />

worth tuning<br />

✦<br />

Open source, BSD license<br />

hpctoolkit.org<br />

Milind Chabbi<br />

<str<strong>on</strong>g>HPCToolkit</str<strong>on</strong>g>: A Tool for <str<strong>on</strong>g>Performance</str<strong>on</strong>g> <str<strong>on</strong>g>Analysis</str<strong>on</strong>g> <strong>on</strong> <strong>Heterogeneous</strong> Supercomputers


C<strong>on</strong>clusi<strong>on</strong>s<br />

<str<strong>on</strong>g>HPCToolkit</str<strong>on</strong>g> is a performance analysis tool for<br />

heterogeneous systems<br />

✦<br />

Sampling-centric approach<br />

★<br />

Low overhead and compact measurement data<br />

✦<br />

CPU-GPU blame shifting<br />

★<br />

Pinpoints and quantifies code fragments (CPU and GPU)<br />

worth tuning<br />

✦<br />

Open source, BSD license<br />

hpctoolkit.org<br />

Milind Chabbi<br />

<str<strong>on</strong>g>HPCToolkit</str<strong>on</strong>g>: A Tool for <str<strong>on</strong>g>Performance</str<strong>on</strong>g> <str<strong>on</strong>g>Analysis</str<strong>on</strong>g> <strong>on</strong> <strong>Heterogeneous</strong> Supercomputers

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!