Instruction Throughput - GPU Technology Conference

More documents

Recommendations

Info

Analysis-Driven Optimization • Determine the limits for kernel performance • Memory throughput • Instruction throughput • Latency • Combination of the above • Use appropriate performance metric for each kernel • For example, for a memory bandwidth-bound kernel, Gflops/s don‟t make sense • Address the limiters in the order of importance • Determine how close resource usage is to the HW limits • Analyze for possible inefficiencies • Apply optimizations - Often these will just be obvious from how HW operates © NVIDIA Corporation 2011 2
Presentation Outline • Identifying performance limiters • Analyzing and optimizing : • Memory-bound kernels (Oct 11th, Tim Schroeder) • Instruction (math) bound kernels • Kernels with poor latency hiding • Register spilling (Oct 4th, Paulius Micikevicius) • For each: • Brief background • How to analyze • How to judge whether particular issue is problematic • How to optimize • Some cases studies based on “real-life” application kernels • Most information is for Fermi GPUs © NVIDIA Corporation 2011 3
Page 1: Instruction Limited Kernels CUDA Op
Page 5 and 6: Limited by Bandwidth or Arithmetic?
Page 7 and 8: Optimizations for Instruction Throu
Page 9 and 10: Instruction Throughput: Analysis
Page 11 and 12: Serialization: Profiler Analysis
Page 13 and 14: Serialization: Analysis with Modifi
Page 15 and 16: Case Study: SMEM Bank Conflicts •
Page 17 and 18: Optimizations for Latency © NVIDIA
Page 19 and 20: Simplified View of Latency and Sync
Page 21 and 22: Latency: Optimization • Insuffici

Instruction Throughput - GPU Technology Conference

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?