11.07.2015 Views

CUPTI User's Guide

CUPTI User's Guide

CUPTI User's Guide

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Sampling EventsThe event API can also be used to sample event values while a kernel or kernels areexecuting (as demonstrated by the event_sampling sample). The sample shows onepossible way to perform the sampling. The event collection mode is set to<strong>CUPTI</strong>_EVENT_COLLECTION_MODE_CONTINUOUS so that the event counters run continuously.Two threads are used in event_sampling: one thread schedules the kernels and memcpysthat perform the computation, while another thread wakes periodically to sample an eventcounter. In this sample there is no correlation of the event samples with what is happeningon the GPU. To get some coarse correlation, you can use cuptiDeviceGetTimestamp tocollect the GPU timestamp at the time of the sample and also at other interesting pointsin your application.Interpreting Event ValuesThe tables below describe the events available for each device. Each event has a type thatindicates how the activity or action associated with that event is collected. The eventtypes are SM, TPC, and FB.SM Event TypeThe SM event type indicates that the event is collected for an action or activity thatoccurs on one or more of the device’s streaming multiprocessors (SMs). A streamingmultiprocessor creates, manages, schedules, and executes threads in groups of 32 threadscalled warps.The SM event values typically represent activity or action of thread warps, and not theactivity or action of individual threads. Details of how each event is incremented are givenin the event tables below.Two factors will impact the accuracy of the values collected for SM type events. First, dueto variations in system state, event values can vary across different, identical, runs of thesame application. Second, for devices with compute capability less than 2.0, SM events arecounted only for one SM. For devices with compute capability greater than 2.0, SM eventsfrom domain_d are counted for all SMs but for SM events from domain_a are countedfor multiple but not all, SMs. To get the most consistent results inspite of these factors, itis best to have number of blocks for each kernel launched to be a multiple of the totalnumber of SMs on a device. In other words, the grid configuration should be chosen suchthat the number of blocks launched on each SM is the same and also the amount of workof interest per block is the same.CUDA Tools SDK <strong>CUPTI</strong> User’s <strong>Guide</strong> DA-05679-001_v01 | 11

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!