Instruction Throughput - GPU Technology Conference

More documents

Recommendations

Info

Latency: Analysis • Suspect latency issues if: • Neither memory nor instruction throughput rates are close to HW theoretical rates • Poor overlap between mem and math - Full-kernel time is significantly larger than max{mem-only, math-only} • Two possible causes: • Insufficient concurrent threads per multiprocessor to hide latency - Occupancy too low - Too few threads in kernel launch to load the <strong>GPU</strong> - Indicator: elapsed time doesn‟t change if problem size is increased (and with it the number of blocks/threads) • Too few concurrent threadblocks per SM when using __syncthreads() - __syncthreads() can prevent overlap between math and mem within the same threadblock © NVIDIA Corporation 2011 18
Simplified View of Latency and Syncs Memory-only time Math-only time Kernel where most math cannot be executed until all data is loaded by the threadblock Full-kernel time, one large threadblock per SM © NVIDIA Corporation 2011 time 19
Page 1 and 2: Instruction Limited Kernels CUDA Op
Page 3 and 4: Presentation Outline • Identifyin
Page 5 and 6: Limited by Bandwidth or Arithmetic?
Page 7 and 8: Optimizations for Instruction Throu
Page 9 and 10: Instruction Throughput: Analysis
Page 11 and 12: Serialization: Profiler Analysis
Page 13 and 14: Serialization: Analysis with Modifi
Page 15 and 16: Case Study: SMEM Bank Conflicts •
Page 17: Optimizations for Latency © NVIDIA
Page 21 and 22: Latency: Optimization • Insuffici

Instruction Throughput - GPU Technology Conference

Create successful ePaper yourself

Delete template?

Save as template?