Instruction Throughput - Nvidia

More documents

Recommendations

Info

Latency: Analysis � Suspect latency issues if: � Neither memory nor instruction throughput rates are close to HW theoretical rates � Poor overlap between mem and math - Full-kernel time is significantly larger than max{mem-only, math-only} � Two possible causes: � Insufficient concurrent threads per multiprocessor to hide latency - Occupancy too low - Too few threads in kernel launch to load the GPU - Indicator: elapsed time doesn‟t change if problem size is increased (and with it the number of blocks/threads) � Too few concurrent threadblocks per SM when using __syncthreads() - __syncthreads() can prevent overlap between math and mem within the same threadblock © NVIDIA Corporation 2011 18
Simplified View of Latency and Syncs © NVIDIA Corporation 2011 time Memory-only time Math-only time Kernel where most math cannot be executed until all data is loaded by the threadblock Full-kernel time, one large threadblock per SM 19
Page 1 and 2: Instruction Limited Kernels CUDA Op
Page 3 and 4: Presentation Outline © NVIDIA Corp
Page 5 and 6: Limited by Bandwidth or Arithmetic?
Page 7 and 8: Optimizations for Instruction Throu
Page 9 and 10: Instruction Throughput: Analysis
Page 11 and 12: Serialization: Profiler Analysis
Page 13 and 14: Serialization: Analysis with Modifi
Page 15 and 16: Case Study: SMEM Bank Conflicts �
Page 17: © NVIDIA Corporation 2011 Optimiza
Page 21 and 22: Latency: Optimization � Insuffici

Instruction Throughput - Nvidia

Create successful ePaper yourself

Delete template?

Save as template?