Instruction Throughput - GPU Technology Conference

21.11.2014 Views
Latency: Analysis • Suspect latency issues if: • Neither memory nor instruction throughput rates are close to HW theoretical rates • Poor overlap between mem and math - Full-kernel time is significantly larger than max{mem-only, math-only} • Two possible causes: • Insufficient concurrent threads per multiprocessor to hide latency - Occupancy too low - Too few threads in kernel launch to load the GPU - Indicator: elapsed time doesn‟t change if problem size is increased (and with it the number of blocks/threads) • Too few concurrent threadblocks per SM when using __syncthreads() - __syncthreads() can prevent overlap between math and mem within the same threadblock © NVIDIA Corporation 2011 18

Simplified View of Latency and Syncs Memory-only time Math-only time Kernel where most math cannot be executed until all data is loaded by the threadblock Full-kernel time, one large threadblock per SM © NVIDIA Corporation 2011 time 19

Page 1 and 2: Instruction Limited Kernels CUDA Op

Page 3 and 4: Presentation Outline • Identifyin

Page 5 and 6: Limited by Bandwidth or Arithmetic?

Page 7 and 8: Optimizations for Instruction Throu

Page 9 and 10: Instruction Throughput: Analysis

Page 11 and 12: Serialization: Profiler Analysis

Page 13 and 14: Serialization: Analysis with Modifi

Page 15 and 16: Case Study: SMEM Bank Conflicts •

Page 17: Optimizations for Latency © NVIDIA

Page 21 and 22: Latency: Optimization • Insuffici

instruction

nvidia

corporation

profiler

throughput

threads

kernel

cuda

warp

latency

gputechconf.com

Instruction Throughput - GPU Technology Conference

Instruction Throughput - GPU Technology Conference ... View more Instruction Throughput - GPU Technology Conference

Delete template?

Save as template ?

Instruction Throughput - GPU Technology Conference Instruction Throughput - GPU Technology Conference