Instruction Throughput - GPU Technology Conference

More documents

Recommendations

Info

Simplified View of Latency and Syncs Memory-only time Math-only time Kernel where most math cannot be executed until all data is loaded by the threadblock Full-kernel time, one large threadblock per SM Full-kernel time, two threadblocks per SM (each half the size of one large one) © NVIDIA Corporation 2011 time 20
Latency: Optimization • Insufficient threads or workload: • Best: Increase the level of parallelism (more threads) • Alternative: Process several output elements per thread – gives more independent memory and arithmetic instructions (which get pipelined) - downside: code complexity • Synchronization Barriers: • Can assess impact on perf by commenting out __syncthreads() - Incorrect result, but gives upper bound on improvement • Try running several smaller threadblocks - Less hogging of SMs; think of it as SM “pipelining” blocks - In some cases that costs extra bandwidth due to more halos • More information and tricks: • Vasily Volkov, GTC2010: “Better Performance at Lower Occupancy” http://www.gputechconf.com/page/gtc-on-demand.html#session2238 © NVIDIA Corporation 2011 21
Page 1 and 2: Instruction Limited Kernels CUDA Op
Page 3 and 4: Presentation Outline • Identifyin
Page 5 and 6: Limited by Bandwidth or Arithmetic?
Page 7 and 8: Optimizations for Instruction Throu
Page 9 and 10: Instruction Throughput: Analysis
Page 11 and 12: Serialization: Profiler Analysis
Page 13 and 14: Serialization: Analysis with Modifi
Page 15 and 16: Case Study: SMEM Bank Conflicts •
Page 17 and 18: Optimizations for Latency © NVIDIA
Page 19: Simplified View of Latency and Sync

Instruction Throughput - GPU Technology Conference

Create successful ePaper yourself

Delete template?

Save as template?