NVIDIA Nsight Visual Studio Edition 2.2 User Guide > Analysis Tools > Other Analysis Reports > Profile CUDA Settings

The NVIDIA Nsight Analysis tools contain a CUDA Profiler Activity that allows you to gather detailed performance information, in addition to timing and launch configuration details.

A CUDA Profiler activity consists of a kernel filter and a set of profiler experiments. Profile experiments are directed analysis tests targeted at collecting in-depth performance information for an isolated instance of a kernel launch.

The CUDA Profiler allows the user to collect an arbitrary number of experiments per kernel launch. Each profile experiment may require the target kernel to be executed one or more times in order to collect all of the required data. In the following examples, the individual iterations of a profile experiment are referred to as Experiment Passes. Executing all passes of all experiments for a target kernel launch is handled transparently to the analyzed application.

NVIDIA Nsight employs a replay mechanism for the executing experiment. A full snapshot of the mutable state of the target CUDA context is captured, before executing the experiment passes. For each experiment pass, the target kernel is executed once, followed by restoring the saved mutable state. Effectively, this rewinds the CUDA context to the exact state before the kernel launch. Subsequent experiment passes are guaranteed to operate on the same, unchanged input data and CUDA state.1

As a result, the introduced performance overhead for profiling a single kernel launch is highly dependent on the following factors:

Given these factors, profiling may incur a fairly large performance overhead to the target process. For that reason, it is recommended that the user limit kernel profiling to the specific kernels of interest, and also limit the enabled profiling experiments to only those that are actionable (with respect to the current code optimization efforts). This will help to maintain fast turnaround times.

 

To collect profile experiments:

  1. Open the Activity Document.
  2. Under Activity Type, select Profile CUDA Application.


  3. The Experiment Settings section of the Activity Document opens. This allows the user to select the subset of kernels that should be profiled, and configure the experiments to be collected.



    1. The section Kernel Selection allows the user to configure a filter to restrict the Kernels to Profile. This filter accepts regular expressions using Perl syntax. All launches for which the kernel name matches the configured filter will be profiled. Specifically, the filter is matched against the kernel's non-mangled name (not mangled name).
      In case no filter is set at all or the specified filter is invalid, all kernel launches for all kernels will be profiled. As this can pose a high performance overhead, a warning icon will be displayed.
    2. If there is a certain number of kernels you wish to profile, you can select the checkbox After skipping N kernels, profile X kernels. This allows you to further customize the kernels that are analyzed in the final output. If the user specifies a number in the X field, the first X number of kernel launches which match the kernel filter will be profiled. Subsequent kernel launches will not be profiled, regardless of the result of the kernel filter. The counter for the capture limit is reset with each new capture session. For example, enter the following: 

      After skipping 5 kernels, profile 80 kernels.

      In this scenario, the first 5 kernels will be skipped, then kernels 6 through 85 will be profiled.

      Note that the counters for skipping kernels and for limiting the profile session are applied after the Kernel RegEx Filter. In other words, kernel names that do not match the Kernel RegEx filter (if set) will not be counted toward the total number of kernels in the two fields (N kernels and X kernels) described above.

    3. The Profile Options allow the user to configure the following parameters of a profile activity:
      • Print Process Output to Console enables writing detailed information about the progress of the experiment data collection to the standard output (stdout).
      • Non-Overlapping Input/Output Buffers allows you to specify that all profiled kernels do not change the contents of their input buffers during execution, or call device malloc/free or new/delete, that leave the device heap in a different state. Specifically, a kernel can malloc and free a buffer in the same launch, but it cannot call an unmatched malloc or an unmatched free. If enabled, this can vastly improve experiment collection, as there is no need to save and restore the mutable state for each experiment pass.

        Note that if the option Non-Overlapping Input/Output Buffers is enabled mistakenly (that is, the profiled kernels do overwrite their input buffers or use unmatched device malloc/free), the behavior of the profiled application is completely undefined. As a consequence, the application might terminate abnormally, or the collected profile data may be invalid.

    4. Under Experiment Configuration, you can define the set of profile experiments to collect. There are two different ways to specify the Experiments to Run:
      • Experiment Templates are predefined groups of experiments that focus on a specific profiling task, or a certain field of interest. Upon selection of an experiment template, a short description (as well as the list of enclosed experiments) is displayed in the lower section of the activity configuration page.
      • Custom Experiment Configuration allows you to manually specify the list of experiments to execute. After selecting this option, the lower portion of the experiment settings changes to the Advanced Experiment configuration.



        As shown here, on the left is a list that includes the available experiment templates, in addition to all available individual experiments. By selecting an item from that list, the experiment is added to the active experiments in the middle column.
        Note that some experiments can be added multiple times to the middle list, while others are only allowed to be added once. Specifically, it is possible to collect an arbitrary number of RAW counter experiments, all with different counter configurations.
        All experiments in the middle section will be executed for each profiled kernel launch, as long as it is supported on the current GPU device architecture, which is indicated in the device architecture columns.
        Based on the experiment selection in the middle table, the right document area will show a brief summary, including information such as the required experiment passes, the collected counter data, or the derived metrics. In addition, a few experiments expose further configuration options through the panel on the right. For example, the RAW counter experiments expose the list of available counters per target device.

 

To View Profiler Experiment Results

Profiler experiment results are displayed per CUDA launch.

  1. In the report, navigate to the CUDA Launches report page.
  2. Select a kernel launch in the table.
    1. Some experiments add columns to the launches table. These can be used to help sort and filter launches.
  3. In the correlation pane, expand the Experiment Results node.
  4. Under Experiment Results, select the results you would like to see.
  5. Many experiments have multiple detail panes. The tab selector can be used to switch between detail panes in an experiment.

Changing the selection in the CUDA Launches table will update both the correlation pane and the details pane.

Profiler Experiments

The following sections list all available experiments for NVIDIA Nsight 2.2 and provide descriptions on the collected data, metrics, and the visualization used in the report pages.

Achieved Occupancy

Multiprocessor occupancy is the ratio of active warps to the maximum number of warps supported on a multiprocessor of the GPU. Theoretical occupancy is the maximum occupancy given the execution configuration. This experiment captures the achieved occupancy, which is the number of active warps per clock cycle divided by maximum warps per multiprocessor.

This experiment reports the achieved occupancy, the theoretical occupancy, and the limiting factors for the theoretical occupancy.

Note that higher occupancy does not always equate to higher performance; there is a point of diminishing returns, above which additional occupancy does not improve performance. However, low occupancy always interferes with the ability to hide memory latency, resulting in performance degradation.

The achieved occupancy may be lower than the theoretical occupancy for the following reasons:

Replay Passes
1 Pass

All Fermi and Kepler cards

Collected Data
Active Warps

per SM

# warps that were active on the SM per cycle
Active Cycles

per SM

#cycles in which at least one warp was active on the SM

 

The term SM refers to a Streaming Multiprocessor.

Metrics
Achieved Occupancy

100 * {Active Warps} / {Active Cycles} / {Max Warps per SM}

Experiment Detail Pane

The CUDA Profiler Experiment results are reported on the CUDA Launches report page in the detail pane.

The achieved occupancy experiment detail panes are based on the Occupancy Calculator spreadsheet, which is included as part of the CUDA toolkit. Two visualization tabs are available:

The Occupancy Data tab details all key information that defines the occupancy for the profiled kernel launch. For each metric in the table, three different values may exist:

The relationship between all three variants of each metric is also displayed in a graph on the right side. The x-axis scale equals the device limit. The bar indicates the theoretical limit. The achieved value is indicated by a red mark, if applicable.

The Occupancy Graphs tab shows possible changes in the theoretical occupancy achieved by varying execution configuration. Graphs are displayed for varying thread block sizes, registers per thread, and shared memory usage. Thread block size can be manipulated by changing the launch dimensions. Registers per thread can be manipulated by compiler or JIT settings or changes ot the code, and shared memory usage can be manipulated by code changes or changes to the dynamic shared memory launch parameter.

The Achieved Occupancy graph on the right shows the achieved occupancy per shared multiprocessor (SM), with the average across all SMs being equal to the value reported in the occupancy data table. High variation in achieved occupancy across the SMs may indicate an insufficient granularity of the kernel thread blocks causing imbalanced workloads or are often the result of flow control decisions being based on the thread block index.



Achieved FLOPs

The Achieved FLOPs experiment provides statistics on the floating point operations of the kernel. For single-precision and double-precision, the execution count per instruction type as well as the operations per second (FLOPs) is reported.

Replay Passes
2 Passes

All Fermi and Kepler cards

Collected Data
Executed Floating Point Operations

per Kernel

Number of floating point operations executed. Fused multiply-add (FMA) and reciprocal square root (RSQ) account for two operations. All other instructions count as one operation.
Elapsed Time

 

Elapsed time for the kernel execution
Metrics
FLOPs

{Executed Floating Point Operations} / {Elapsed Time}

Experiment Details Pane

The left chart shows the total count of floating point operations executed during the kernel execution for each of the tracked groups of operations. On the right, the achieved single-precision floating point operations per second and its double-precision counterpart are shown.


 

Instruction Statistics

The Instruction Statistics experiment provides key metrics to evaluate the efficiency to execute the kernel's instruction on the target device, including instructions per clock cycle (IPC), instruction serialization, SM activity, and instructions per warp (IPW).

Instructions Per Cycle

For devices of compute capability 2.x, a multiprocessor consists of: 

The first scheduler is in charge of the warps with an odd ID, and the second scheduler is in charge of the warps with an even ID.

When a scheduler issues a double-precision floating-point instruction, the other scheduler cannot issue any instruction.

A warp scheduler can issue an instruction to only half of the CUDA cores. To execute an instruction for all threads of a warp, a warp scheduler must therefore issue the instruction over two clock cycles for an integer or floating-point arithmetic instruction.

Instructions Executed is the sum of instructions issued by the two schedulers not including replays.

Instructions Issued is the sum of the instructions issued by the two schedulers, including replays. Instruction replays only occur for memory operations. Shared memory loads/stores require replays for each bank conflict. Global and local memory replays occur if a request accesses multiple cache lines, or the request size per thread is greater than 32 bits.

Executed IPC and Issued IPC metrics are a measure of instruction throughput when the multiprocessor is active. Instruction Serialization is the percentage of replayed instructions to issued instructions. This can be used to understand how memory operations are affecting instruction issue rate.

Load Balancing

For devices of compute capability 2.x, the global work distribution engine schedules thread blocks to multiprocessors. Each thread block is 1 to 32 warps. The multiprocessor Active Cycles counter is the number of clock cycles the multiprocessor was managing at least one warp.

Active cycles counter can be used to determine load balancing problems caused by inconsistent work per thread or insufficient number of blocks in a launch. If the program executes the kernel concurrently with other kernels then multi-processor load balancing is not as critical as using active cycles counter per multiprocessor to determine the major influence on the kernel’s duration.

The Warps Launched counter can be used to determine if you have equal work assigned to each SM. Warps launched can be converted to Thread Blocks Launched (also known as CTAs launched) by the following formula:

Thread Blocks Launched = Threads Per Block + (WARP_SIZE – 1)/ Warps Launched

Replay Passes
1 Pass GF100, GF110, all Kepler cards
2 Passes

GF10x, GF11x

Collected Data
Active Cycles per SM # cycles the SM had at least one active warp
Instructions Executed per SM # instructions executed not including replays
Instructions Issued per SM # instructions issued including replays
Warps Launched

per SM

# warps launched on an SM
Elapsed Clock

 

Elapsed clock ticks for the kernel execution
Metrics
Issued IPC {Instructions Issued} / {Active Cycles}
Executed IPC {Instructions Executed} / {Active Cycles}
Instruction Serialization ({Instructions Issued} - {Instructions Executed}) / {Instructions Issued}
SM Activity {Active Cycles} / {Elapsed Clock}
IPW

{Instructions Executed} / {Warps Launched}

Experiment Details Pane

The CUDA Instruction Statistics detail pane contains five charts:

  1. Instructions Per Clock
    This chart is a z-ordered column graph that shows instructions executed and instructions issued metric.
    Instructions Issued takes all replayed instructions into account and is consequently always equal to or greater than the number of executed instructions.
    The theoretical maximum IPC is defined by the target GPU device only. The y-axis of the graph scales according to this theoretical maximum.
    Common reasons for obtaining low executed IPC include:
  2. Instruction Serialization
    This chart further details the difference between executed instructions and issued instructions. It denotes the ratio of replayed instructions per issued operation. High serialization typically indicates bad memory access patterns.
  3. SM Activity
    This chart show the percentage of time each multiprocessor was active during the launch duration. A SM is considered to be active if at least one warp is currently assigned for execution.
    An SM can be inactive, even though the kernel grid is not yet completed, due to high workload imbalances. Such load balancing issues between the SMs can be caused by a few factors: different execution times for the kernel blocks, variations between the number of scheduled blocks per SM, or a combination of the two.
  4. Instructions Per Warp
    This chart shows the average instructions per warp for each SM and for the kernel. This metric can be used to spot any variations in instructions executer per warp. High variations in this metric are typically caused by following different flow control paths through the kernel code.
  5. Warps Launched
    This chart shows the warps launched per SM, and the average warps launched per SM. Large differences in the number of warps executed per SM are most commonly the result of providing an insufficient amount of parallelism with the kernel grid. More specifically, the number of kernel blocks is too low to make good use of all available compute resources. A high variation in warps launched is only a concern if SM activity is low on one or more SMs. In this case, you should consider a different method for partitioning your workload that either results in less variance in duration per warp or the execution of many more thread blocks, so that the work distributor can dispatch work more evenly.

 

Issue Efficiency

Issue Efficiency provides information about the device's ability to issue the instructions of the kernel. The data reported includes execution dependencies, eligible warps, and SM stall reasons.

For devices of compute capability 2.x, a multiprocessor has two warp schedulers. Each warp scheduler manages at most 24 warps, for a total of 48 warps per multiprocessor. The kernel execution configuration may reduce the runtime limit. For information on occupancy, see the Achieved Occupancy experiment. The first scheduler is in charge of the warps with an odd ID, and the second scheduler is in charge of warps with an even ID.

At every instruction issue time, each scheduler will pick an eligible warp from its list of active warps and issue an instruction. A warp is eligible if the instruction has been fetched, the execution unit required by the instruction is available, and the instruction has no dependencies that have not been met.

The schedulers report the following statistics on the warps in the multiprocessor:

Replay Passes
29 Passes All Fermi cards
7 Passes All Kepler cards
Collected Data
Active Warps per SM # warps that were active on the SM per cycle
Active Cycles per SM # cycles the kernel was active on the SM
Eligible Warps per SM # active warps that were eligible to issue an instruction per cycle
Zero Eligible Warps per GPU # cycles no warp was eligible to issue an instruction
One Eligible Warp

per GPU

# cycles only a single warp was eligible to issue an instruction
Warp Issue Holes

per SM

# warps not eligible to issue an instruction
Long Warp Issue Holes per SM # warps not eligible to issue an instruction for more than 32 cycles
Instructions Executed per SM # instructions executed not including replays
Issue Stall Reasons per GPU # warps that were not eligible to issue to an instruction to one of the following reasons: Instruction Fetch, Execution Dependency, Synchronization, Data Requests, Texture, or Other.
Experiment Detail Pane

The experiment detail pane groups information into three tabs: 

1. Eligible Warps Tab

Among all of the warps currently assigned to an SM, only a few may be eligible to issue their next instruction. Ineligible warps may have to wait on the completion of previously issued instructions, require currently unavailable device resources, or all other warps to reach a synchronization point. Having no eligible warp can lead to missing out on issuing a new instruction, and ultimately lowers the achieved IPC.

Some GPU devices can issue multiple instructions per cycle. For such devices having only one eligible warp is not sufficient to utilize all available compute resources. Hence, it is recommended to aim at having at least two warps eligible at any given time of the kernel execution.

One prerequisite of having enough eligible warps is to provide a sufficient number of Active Warps per Active Cycle on each SM. The longer each warp stays in a non-eligible state, the more active warps are necessary to hide the latency until a warp becomes eligible again.

The ratio of the active warps per cycle to the maximum number of active warps per SM is also referred to as achieved occupancy. Further details can be found in Achieved Occupancy experiment.

For each SM, the number of active warps is the upper limit of the number of the eligible warps. It is recommended to have at least two warps eligible at all times. However, further increasing this number has typically no additional effect on the performance of executing the kernel.

2. Issue Dependencies Tab

Instruction dependencies can keep warps in an ineligible state for long period of times. In other words, in order to be executable, all inputs for an instruction need to be fully calculated and readily available to the multiprocessor. If this is not the case, the warp cannot issue this instruction, and needs to wait until the dependencies are resolved.

The Execution Dependency Cycles provide an estimate on how many cycles on average each instruction needs to wait, until all dependencies are fully resolved. The distribution between short and long dependency cycles provide further insight as to how often a warp was stalled for more than 32 cycles. These long dependencies are typically caused by dependencies on global memory instructions or texture operations, or stalls at barriers.

Derived from the dependency cycles and the active warps, the Max Dependency IPC defines an upper bound for the kernel's achievable instruction throughput. Closely related is the Max Dependency Utilization, which describes the ratio of the achieved IPC versus the Max Dependency IPC.



3. Issue Stalls Tab

The Warp Issue Efficiency shows the overall ratios (averaged across all SMs) of having (a) no eligible warp, (b) only a single eligible warp, or (c) two or more eligible warps, across all active cycles of a kernel execution. Having less than two eligible warps can cause some (or all) schedulers to fail to issue an instruction.

Optimizations based on the stall reasons should only be done if the percentage of cycles that 0 or 1 warp is eligible to issue is large. If this percentage is low, then there are sufficient eligible warps per cycle for the scheduler to choose from.

 

 

Branch Statistics

Branch Statistics provide insight to the kernel's usage of flow control, such as the ratio of uniform versus diverged branches, the number of branches per warp, or the flow control efficiency.

Replay Passes
3 Passes GF100, GF110
6 Passes

GF10x, GF11x

2 Passes All Kepler cards
Collected Data
Branches per SM # branch instructions executed
Taken Branches per SM # branches taken by at least one thread
Diverged Branches per SM # branches that were evaluated differently across a warp
Warps Launched

per SM

# warps launched on an SM
Instructions Executed per SM # instructions executed not including replays
Thread Instructions Executed

per SM

Sum of active threads for all executed instructions.

 

Note that the semantics of the branch counters slightly differs between Fermi and Kepler architectures.

For example, on Fermi, every EXIT instruction counts as a branch. However, on Kepler, non-diverging EXIT instructions are no longer considered a branch. Consequently, the results of the branch experiments might differ between the two architectures, even when the executed assembly code is the same.

Metrics
Branch Efficiency ({Branches} - {Diverged Branches}) / {Branches}
Control Flow Efficiency {Thread Instructions Executed} / {Instructions Executed} / {Warps Size}
Branch Condition (false) {Branches} - {Taken Branches}
Branch Condition (diverged) {Diverged Branches}
Branch Condition (true) {Taken Branches} - {Diverged Branches}
Experiment Details Pane

Flow control can have a serious impact on the efficiency of executing a kernel. This is even more likely if a lot of flow control decisions are divergent, forcing the threads of a warp into highly different control path throughout the kernel. Therefore, it is important to understand how often flow control decisions are made, how many of them are divergent, and how much they affect overall performance.

Branch Efficiency is the ratio of uniform flow control decisions, over all conditionals executed. The lower this metric, the more often warps split their control flow. This may lead to lowered overall execution performance.

Control Flow Efficiency provides an estimate on an upper bound for the usage of all available compute resources based on the executed flow control. That is, the lower this efficiency values becomes, the more compute resources are left idle while executing the kernel.

There are many ways a conditional expression written in CUDA C can be translated to final assembly code. Note that not all possible variants of a conditional operation can be accounted for with this metric. Hence, it is valid to state that a low control flow efficiency definitely indicates a performance issue. However, the reverse might not necessarily be true – and a kernel exposing a close-to-optimal control flow efficiency might still be impacted by flow control.

To further look into the distribution of uniform versus divergent flow control decisions, the Branches Per Warp and the Branch Condition provide the total counts of conditionals made, and their respective percentage ratios. Both metrics are provided for each of the three outcomes of an assembly conditional: Not Taken the branch, Diverged due to an ambiguous evaluation of the conditional across the threads of a warp, and Taken the branch.

 

Memory Statistics

The memory statistics experiment consists of various sub-experiments that contribute to a combined in-depth view of a kernel’s usage of the CUDA memory hierarchy. Having individual tests, rather than one monolithic experiment, allows limiting the data collection to specific memory spaces only. For overall performance reasons, it is recommended to enable only the sub-experiments covering the memory spaces that are being used by the kernel program.

Experiment Details Pane

If at least one memory sub-experiment was collected, the overview tab of the memory experiment becomes available. In addition, each individual sub-experiment will also provide its own custom visualization tab. For all tabs, the data is presented in a graphical representation, as well as in tabular form.

The overview page provides a high level view of all memory transactions made from the kernel. The chart represents the local view of the CUDA memory hierarchy, with the kernel’s requests coming in on the left:

The individual visualization tabs of the sub-experiment provide further details and additional metrics for a specific memory space. Where applicable, these pages provide the number of Transactions Per Request necessary to complete the memory operation. Depending on the memory pattern and the number of threads actively participating in a memory request, one or multiple transactions might be required to complete the operation. Each transaction causes the memory operation to be issued once. In case of high number of transactions per request, try optimizing your memory access pattern in order to make them more coherent and better aligned with the cache lines.

 

RAW Counters

The RAW Counter experiment provides the ability to collect and analyze performance data as collected at runtime on the device. The configuration panel on the activity page allows specifying the set of counters to be collected for the targeted GPU device.

Two variants of the RAW Counter experiment exist. The RAW Counters – Tesla Architecture experiment handles all cards based on the Tesla architecture, while the Fermi and Kepler architectures are both covered by the RAW Counters – Fermi/Kepler experiment.




As only a limited number of performance counters can be collected at a time, the selection of counters limits the list of counters to choose, based on the already selected counters. However, multiple instances of the RAW counter experiment can be added to a profile activity. This efficiently allows you to collect a broad set of performance counters, as well as allow configuring different sets of counters to be collected per target device.

Note that multiple instances of the RAW counters are collected using multiple replayed launches of the target kernel. Even though operating on the exact same input data, there might be observable variations in execution the replay passes. As a consequence, RAW counters covering only a subset of the GPU device might expose variations across the replay passes. In case a metric requires input values from multiple replay passes, it is recommend that you sample at least one counter across all involved experiment passes, and use this counter for normalizing counter values across the experiments.

Replay Passes
1 Pass All Tesla cards
1 Pass All Fermi and Kepler cards (if Collect from all units is disabled)
1-8 Passes All Fermi and Kepler cards (if Collect from all units is enabled)


Profiler Triggers

Profiler triggers n: 00 to 07. To use profile triggers, add the following statement to kernel code: 

__prof_trigger(n);

where n is a two digit whole number: 01, 02, 03, etc., up to 07.

 

Report Page

The results of the RAW Counter experiments are reported in two places:

For Fermi and Kepler architectures, the columns in the output include: 

For the Tesla architecture, the columns in the output include: 

 

 

 

 


NVIDIA® Nsight™ Development Platform, Visual Studio Edition User Guide Rev. 2.2.120522 ©2009-2012. NVIDIA Corporation. All Rights Reserved.