Instruction Throughput - Nvidia

More documents

Recommendations

Info

Simplified View of Latency and Syncs © NVIDIA Corporation 2011 time Memory-only time Math-only time Kernel where most math cannot be executed until all data is loaded by the threadblock Full-kernel time, one large threadblock per SM Full-kernel time, two threadblocks per SM (each half the size of one large one) 20
Latency: Optimization � Insufficient threads or workload: � Best: Increase the level of parallelism (more threads) � Alternative: Process several output elements per thread – gives more independent memory and arithmetic instructions (which get pipelined) - downside: code complexity � Synchronization Barriers: � Can assess impact on perf by commenting out __syncthreads() - Incorrect result, but gives upper bound on improvement � Try running several smaller threadblocks - Less hogging of SMs; think of it as SM “pipelining” blocks - In some cases that costs extra bandwidth due to more halos � More information and tricks: � Vasily Volkov, GTC2010: “Better Performance at Lower Occupancy” http://www.gputechconf.com/page/gtc-on-demand.html#session2238 © NVIDIA Corporation 2011 21
Page 1 and 2: Instruction Limited Kernels CUDA Op
Page 3 and 4: Presentation Outline © NVIDIA Corp
Page 5 and 6: Limited by Bandwidth or Arithmetic?
Page 7 and 8: Optimizations for Instruction Throu
Page 9 and 10: Instruction Throughput: Analysis
Page 11 and 12: Serialization: Profiler Analysis
Page 13 and 14: Serialization: Analysis with Modifi
Page 15 and 16: Case Study: SMEM Bank Conflicts �
Page 17 and 18: © NVIDIA Corporation 2011 Optimiza
Page 19: Simplified View of Latency and Sync

Instruction Throughput - Nvidia

Create successful ePaper yourself

Delete template?

Save as template?