Analysis Tools > CUDA Experiments

CUDA Experiments

NVIDIA® Nsight™ Development Platform, Visual Studio Edition 3.2 User Guide
Send Feedback

The NVIDIA Nsight Analysis tools contain a CUDA Profiler Activity that allows you to gather detailed performance information, in addition to timing and launch configuration details.

A CUDA Profiler activity consists of a kernel filter and a set of profiler experiments. Profile experiments are directed analysis tests targeted at collecting in-depth performance information for an isolated instance of a kernel launch.

The CUDA Profiler allows the user to collect an arbitrary number of experiments per kernel launch. Each profile experiment may require the target kernel to be executed one or more times in order to collect all of the required data. In the following examples, the individual iterations of a profile experiment are referred to as Experiment Passes. Executing all passes of all experiments for a target kernel launch is handled transparently to the analyzed application.

NVIDIA Nsight employs a replay mechanism for the executing experiment. A full snapshot of the mutable state of the target CUDA context is captured, before executing the experiment passes. For each experiment pass, the target kernel is executed once, followed by restoring the saved mutable state. Effectively, this rewinds the CUDA context to the exact state before the kernel launch. Subsequent experiment passes are guaranteed to operate on the same, unchanged input data and CUDA state. 1The NVIDIA Nsight CUDA Profiler attempts to execute the CUDA kernel transparently to the application. There are several issues that may lead to problems if either (1) the application has a hard time limit for the kernel or other interaction on the thread that launches the kernel, or (2) the kernel performs an IPC mechanism with the CPU that result in state changes on the CPU side.

As a result, the introduced performance overhead for profiling a single kernel launch is highly dependent on the following factors:

The number of enabled profile experiments, or more specifically, the resulting number of experiment passes to be executed.
The execution time of the target kernel on the GPU. Each experiment pass will launch the kernel exactly once. Some experiment passes modify the kernel binary, increasing the execution time.
The overall size of the mutable state of the CUDA context at the time a kernel is profiled. The mutable state includes all allocated memory that is accessible from the kernel on both the GPU memory as well as the host system.
The target system's overall performance to copy, save, and restore the mutable device memory, which may include transferring all device memory through the PCI bus to the host system.

Given these factors, profiling may incur a fairly large performance overhead to the target process. For that reason, it is recommended that the user limit kernel profiling to the specific kernels of interest, and also limit the enabled profiling experiments to only those that are actionable (with respect to the current code optimization efforts). This will help to maintain fast turnaround times.

To collect profile experiments:

Open the Activity Document.
Under Activity Type, select Profile CUDA Application.

The Experiment Settings section of the Activity Document opens. This allows the user to select the subset of kernels that should be profiled, and configure the experiments to be collected.

The section Kernel Selection allows the user to configure a filter to restrict the Kernels to Profile. This filter accepts regular expressions using Perl syntax. All launches for which the kernel name matches the configured filter will be profiled. Specifically, the filter is matched against the kernel's non-mangled name (not mangled name).
In case no filter is set at all or the specified filter is invalid, all kernel launches for all kernels will be profiled. As this can pose a high performance overhead, a warning icon will be displayed.

If there is a certain number of kernels you wish to profile, you can select the checkbox After skipping N kernels, profile X kernels. This allows you to further customize the kernels that are analyzed in the final output. If the user specifies a number in the X field, the first X number of kernel launches which match the kernel filter will be profiled. Subsequent kernel launches will not be profiled, regardless of the result of the kernel filter. The counter for the capture limit is reset with each new capture session. For example, enter the following:

After skipping 5 kernels, profile 80 kernels.

In this scenario, the first 5 kernels will be skipped, then kernels 6 through 85 will be profiled.

Note that the counters for skipping kernels and for limiting the profile session are applied after the Kernel RegEx Filter. In other words, kernel names that do not match the Kernel RegEx filter (if set) will not be counted toward the total number of kernels in the two fields (N kernels and X kernels) described above.

The Profile Options allow the user to configure the following parameters of a profile activity:

Print Process Output to Console enables writing detailed information about the progress of the experiment data collection to the standard output (stdout).

Non-Overlapping Input/Output Buffers allows you to specify that all profiled kernels do not change the contents of their input buffers during execution, or call device malloc/free or new/delete, that leave the device heap in a different state. Specifically, a kernel can malloc and free a buffer in the same launch, but it cannot call an unmatched malloc or an unmatched free. If enabled, this can vastly improve experiment collection, as there is no need to save and restore the mutable state for each experiment pass.

Note that if the option Non-Overlapping Input/Output Buffers is enabled mistakenly (that is, the profiled kernels do overwrite their input buffers or use unmatched device malloc/free), the behavior of the profiled application is completely undefined. As a consequence, the application might terminate abnormally, or the collected profile data may be invalid.

Under Experiment Configuration, you can define the set of profile experiments to collect. There are two different ways to specify the Experiments to Run:
- Experiment Templates are predefined groups of experiments that focus on a specific profiling task, or a certain field of interest. Upon selection of an experiment template, a short description (as well as the list of enclosed experiments) is displayed in the lower section of the activity configuration page.
- Custom Experiment Configuration allows you to manually specify the list of experiments to execute. After selecting this option, the lower portion of the experiment settings changes to the Advanced Experiment configuration.
  
  As shown here, on the left is a list that includes the available experiment templates, in addition to all available individual experiments. By selecting an item from that list, the experiment is added to the active experiments in the middle column.
  Note that some experiments can be added multiple times to the middle list, while others are only allowed to be added once. Specifically, it is possible to collect an arbitrary number of RAW counter experiments, all with different counter configurations.
  All experiments in the middle section will be executed for each profiled kernel launch, as long as it is supported on the current GPU device architecture, which is indicated in the device architecture columns.
  Based on the experiment selection in the middle table, the right document area will show a brief summary, including information such as the required experiment passes, the collected counter data, or the derived metrics. In addition, a few experiments expose further configuration options through the panel on the right. For example, the RAW counter experiments expose the list of available counters per target device.

To View Profiler Experiment Results

Profiler experiment results are displayed per CUDA launch.

In the report, navigate to the CUDA Launches report page.
Select a kernel launch in the table.
- Some experiments add columns to the launches table. These can be used to help sort and filter launches.
In the correlation pane, expand the Experiment Results node.
Under Experiment Results, select the results you would like to see.
Many experiments have multiple detail panes. The tab selector can be used to switch between detail panes in an experiment.

Changing the selection in the CUDA Launches table will update both the correlation pane and the details pane.

Open topic with navigation