Instruction Throughput - GPU Technology Conference
Instruction Throughput - GPU Technology Conference Instruction Throughput - GPU Technology Conference
Latency: Analysis • Suspect latency issues if: • Neither memory nor instruction throughput rates are close to HW theoretical rates • Poor overlap between mem and math - Full-kernel time is significantly larger than max{mem-only, math-only} • Two possible causes: • Insufficient concurrent threads per multiprocessor to hide latency - Occupancy too low - Too few threads in kernel launch to load the GPU - Indicator: elapsed time doesn‟t change if problem size is increased (and with it the number of blocks/threads) • Too few concurrent threadblocks per SM when using __syncthreads() - __syncthreads() can prevent overlap between math and mem within the same threadblock © NVIDIA Corporation 2011 18
Simplified View of Latency and Syncs Memory-only time Math-only time Kernel where most math cannot be executed until all data is loaded by the threadblock Full-kernel time, one large threadblock per SM © NVIDIA Corporation 2011 time 19
- Page 1 and 2: Instruction Limited Kernels CUDA Op
- Page 3 and 4: Presentation Outline • Identifyin
- Page 5 and 6: Limited by Bandwidth or Arithmetic?
- Page 7 and 8: Optimizations for Instruction Throu
- Page 9 and 10: Instruction Throughput: Analysis
- Page 11 and 12: Serialization: Profiler Analysis
- Page 13 and 14: Serialization: Analysis with Modifi
- Page 15 and 16: Case Study: SMEM Bank Conflicts •
- Page 17: Optimizations for Latency © NVIDIA
- Page 21 and 22: Latency: Optimization • Insuffici
Latency: Analysis<br />
• Suspect latency issues if:<br />
• Neither memory nor instruction throughput rates are close to HW theoretical rates<br />
• Poor overlap between mem and math<br />
- Full-kernel time is significantly larger than max{mem-only, math-only}<br />
• Two possible causes:<br />
• Insufficient concurrent threads per multiprocessor to hide latency<br />
- Occupancy too low<br />
- Too few threads in kernel launch to load the <strong>GPU</strong><br />
- Indicator: elapsed time doesn‟t change if problem size is increased (and with<br />
it the number of blocks/threads)<br />
• Too few concurrent threadblocks per SM when using __syncthreads()<br />
- __syncthreads() can prevent overlap between math and mem within the same threadblock<br />
© NVIDIA Corporation 2011<br />
18