NVIDIA® Parallel Nsight 1.5 Release Notes
Release Notes
Send Feedback

Glossary Item Box

Important information about the NVIDIA Parallel Nsight™ 1.5 September 2010 RC release

 

Parallel Nsight requires a license
This release of Parallel Nsight requires a valid license key to function. The NVIDIA Parallel Nsight™ 1.5 September 2010 RC release requires either a Standard license (no cost) or Professional ($349/seat) license. Please see www.nvidia.com/ParallelNsight for more information on obtaining either type of license.

 

Visual Studio 2010
Parallel Nsight 1.5 introduces compatibility with Visual Studio 2010. You can debug, analyze, and profile your applications in Visual Studio 2010 or Visual Studio 2008.

 

Display Driver
You must install the NVIDIA display driver that supports the Parallel Nsight tools. If you have an NVIDIA graphics card installed on your target machine, you probably already have an NVIDIA display driver. However, Parallel Nsight requires a specific version of the driver in order to function properly. From the NVIDIA web site, download and install the following display driver:

Release 260, Version 260.61

 

You can obtain the required driver from the same location where you obtained this September 2010 RC release.

 

NTFS File System

The Parallel Nsight Host software requires an NTFS files system. Parallel Nsight does not work with a file system based on the file allocation table (FAT) format. This means that you cannot install Parallel Nsight on a FAT or FAT32 system. The reason for this is that the Parallel Nsight Host software uses junction points, which are not supported on FAT systems. Make sure to install the Parallel Nsight Host software on an NTFS file system. If you setup remote debugging by installing the Parallel Nsight Monitor on a separate machine (target machine), the target machine can use FAT, FAT32, or NTFS.

 

See below for more release information about:

CUDA Debugger

Graphics Inspector and Graphics Debugger

Analysis Tools

 

CUDA Debugger

New In The 1.5 September 2010 RC Release (history)

  • Support for Visual Studio 2010
    The CUDA Debugger works with either Visual Studio 2008 or Visual Studio 2010. Parallel Nsight includes build customizations for Visual Studio 2010 and for versions 3.1 and 3.2 of the CUDA Toolkit. Be aware that to use Parallel Nsight with Visual Studio 2010, you must have Visual Studio 2008 installed.
  • CUDA Toolkit 3.2 RC
    The CUDA Debugger supports projects built with CUDA Toolkit version 3.2 RC. The directory structure of the CUDA Toolkit created by Parallel Nsight is compatible with the directory structure of the CUDA Toolkit 3.1. Be aware that upgrading samples from Parallel Nsight 1.0 for use with the CUDA Toolkit 3.2 requires adjustments for build rules and project properties. See the User Guide for more information.
  • Tesla Compute Cluster (TCC)
    (Preview support) The CUDA Debugger can debug GPUs that use Tesla Compute Cluster (TCC) drivers in the R260 driver.
  • The CUDA Debugger supports 6GB GPUs, such as the Quadro 6000.
  • The CUDA Memory Checker now supports Fermi-based GPUs.
  • The CUDA Debugger supports debugging on the GF104-based GeForce GTX460 GPU.
  • The CUDA Debugger now issues warnings on stack underflow.
  • Support for 64 bit pointers and expressions. The debugger now displays 64-bit integers and pointers correctly. Previous releases incorrectly displayed a value stored in memory if the value >= 2^32.
  • CUDA Debug Focus
    There is a new setting called “Unconditional breakpoints follow focus” for controlling whether the debugger automatically switches all breakpoints to the current focus. When disabled, the debugger will hit breakpoints in all warps that hit a breakpoint. When enabled, only the current focus set via the CUDA Debug Focus will hit the breakpoints.

Changed Features and Fixed Issues In The 1.5 September 2010 RC  Release Of The CUDA Debugger  (history)

  • Release 1.5 has improved stability.
  • Many expression evaluation and symbolics fixes.
  • A variety of run control and stepping behaviors have been fixed. For example, local variables no longer show inaccurate values while stepping.
  • Fixed the error on Fermi when a breakpoint hit in the first kernel launch.
  • Data breakpoints now working after first kernel launch.
  • Double clicking on a stack frame in Visual Studio now takes you to the correct source location.
  • A target application no longer hangs after a channel error.
  • The value of float3 variables in the locals view are now correct.
  • The printf function in kernels no longer causes run control and expression evaluation failures.
  • LMem being reallocated with zero size no longer causes a crash.
  • Eigenvalues sample no longer shows incorrect floating point results in locals view.
  • Functions that have the __noinline__ keyword, now show symbols in the locals view.
  • Local const variables are now accessible in locals view.
  • Memory Checker works on Fermi devices.
  • Eliminated false positives in the CUDA Memory checker.
  • Duplicate entries no longer appear in the Visual Studio call stack.

Known Issues With CUDA Debugging In The 1.5 September 2010 RC  Release  (history)

  • Though Visual Studio 2010 is supported as a host IDE for all functionality in Parallel Nsight, CUDA C code must still be compiled using the Microsoft version 9.0 compilers (originally shipped with Visual Studio 2008). Please see the documentation in order to see how to set your CUDA project up properly with Visual Studio 2010 and the 9.0 compilers.
  • TCC and abnormal termination: Do not kill a process that is executing code on a TCC device, except through the normal Stop Debugging command (SHIFT+F5) in Visual Studio. Abnormal termination of a debugging process on a TCC device results in unpredictable behavior. It causes future calls to cuCtxInit() to hang indefinitely, even though the killed process seems to terminate normally. The only way to recover is to reboot the target machine. This is an issue associated with the current release of the TCC driver.
  • When CUDA debugging on a remote machine with an attached display, sometimes it takes a long time (2+ seconds) for a step to occur. This does not happen when running on a headless configuration (adapter with no attached displays). To avoid the delay, use on-board video - dedicate the NVIDIA GPU to CUDA debugging.
  • CUDA Toolkit support
    CUDA Toolkit 2.3 is no longer compatible with the Parallel Nsight CUDA C debugger.
    CUDA Toolkit 3.0 is no longer supported with the Parallel Nsight CUDA C debugger.
  • If you use the CUDA 3.2 Toolkit and target a Fermi GPU, the debugger does not display the correct source line for the following statements:
       asm("brkpt;");
       __trap();
  • In certain cases where an application uses Direct3D/CUDA C interoperation, breakpoints might not be hit.
  • If you experience stability problems when debugging CUDA C on a GF100-based device (GTX470 or GTX480), disable the Freeze Warps setting:
    1. From the Nsight menu select Nsight Options.
    2. Click on CUDA.
    3. Set Stepper freezes non-focused warps to False.
  • Variables do not appear for source code that is not executed. This occurs because the compiler aggressively optimizes code even if you have not specified any compiler optimizations. As a result, the compiler removes any code that will not be executed from the output executable.
  • Breakpoints will hit multiple times on lines that have more than one inline function call. For example, setting a breakpoint on:
        x = cos() + sin()
    will generate three breakpoints on that line. One for the evaluation of the expression, plus one for each function on the line.
  • Visual Studio conditional breakpoints are not supported on program variables. That means that you cannot enter a single breakpoint to stop on "x==0, y==0".
  • In some situations, hitting a breakpoint prevents error detection. It is possible that errors can be masked in the following type of situation:
    - launch CUDA kernel with 2 blocks running on different SMs; each block has 1 warp
    - warp A hits a breakpoint while concurrently warp B does a write to an invalid address
    In this situation, the launch would succeed - but if run without breakpoints, a launch failure would occur.
    The above is just an example. It is possible that there are other ways for this to occur.
  • Unloading modules does not refresh the state of breakpoints set in that module. This means that those breakpoints do not show their latest state in Visual Studio when they have been unloaded.
  • The Visual Studio Breakpoint "Filter" option is not supported for CUDA GPU breakpoints.
  • The Visual Studio Breakpoint "Hitcount" option is not supported for CUDA GPU breakpoints.
  • The variables flattenedBlockIdx and flattenedThreadIdx do not work in the Watch window.
    (The variables threadIdx.x, blockIdx.x, and gridDim work.)
  • When the CUDA debugger pauses (program execution paused), if there are no CUDA contexts, Visual Studio shows the CUDA_SourceNotFound.txt page.
  • You cannot use the Visual Studio "Attach" function when using the CUDA debugger.
  • The CUDA Device Summary page can take a while to update. The default step speed is greater than 10 per second. If the device summary page is open, step speed can be reduced to less than 1 per second.
    Note: The time that it takes to update the CUDA Device Summary page is proportional to the number of executing blocks + warps on the machine. In other words, it's snappy if only 8 warps are present in the grid, but very slow if 900 warps are present.
  • You cannot set environment variables for a process launch.
  • The F5 hotkey (which is the default hotkey in Visual Studio for starting the CPU debugger) does not start the CUDA debugger.
    To start the CUDA debugger, you must either change the key bindings or use the menu command:
    NsightStart CUDA Debugging.
  • There is no support for automatically performing a Build when launching the debugger.
  • The Load Symbols option, or "Symbols settings", in the Modules view is not supported for CUDA debugging.
  • You cannot step over a printf() statement. Workaround: instead of stepping over the statement, set a breakpoint on the statement after the printf() statement. When debugging, if execution is paused on the printf() statement, resume execution (Run) instead of stepping over. Execution will continue until reaching the breakpoint.
  • In the CUDA Device Summary window, when Context is selected the fields Device Ordinal and GDI Device display incorrect information.
  • When application execution is paused on a line of code that contains a conditional statement (such as an IF or WHILE statement), if you Step Over the statement, the debugger does not pause at the next line of code after the conditional statement (after the closing curly brace). Instead, the debugger pauses execution on the last line of code in the branch that was executed. This happens because of the way the compiler emits line tables, and treats curly braces. For more information on this unexpected behavior, see the article in the Parallel Nsight Knowledge Base titled "Run Control And Conditional Statements In CUDA Code."

 

 

Graphics Inspector and Graphics Debugger

New In The 1.5 September 2010 RC  Release (history)

  • Startup time has been improved for graphics debug sessions. The Graphics Inspector comes up much fast when launching a debug session.
  • New Direct3D11 DXGI texture formats are now supported and viewable.
  • Textures from the current draw call’s pixel shader are viewable directly on the Parallel Nsight HUD.
  • UAV textures now visible as shader resources thumbnails.
  • Frame profiler now shows the index of draw/dispatch call in addition to its event index.
  • Resources are named after their D3DDebugObjectName if set.
  • Added Technique/Pass information to the breakpoint data for the Geometry Viewer.
  • Geometry Viewer now shows the full vertex buffer + paused vertices.
  • Parallel Nsight panels are disabled during shader debugging to prevent illegal GPU access.
  • Texture viewer now displays all visible DX11 formats.
  • Pixel values that appear in the Resource Page of the Graphics Inspector show the values in their underlying native formats, such as 32bit floats (instead of translating those values to rendered RGBA 8bit values).
  • HUD configuration signals can now be sorted by State and Name.
  • Support for DirectCompute debugging on all devices, including Fermi-based and Tesla-based devices. In addition, you can now use the Graphics Inspector to profile DirectCompute dispatches, which includes information on time spent in GPU processing and occupancy information.
  • Geometry shader, Hull shader, Domain shader, and Compute shader input parameters are now displayed correctly in the Locals and Watch window.
  • Support for stepping into dynamically linked shader classes.

Fixed In This Release Of The Graphics Debugger (history)

  • Source lines are no longer skipped when stepping in a Vertex Shader.
  • Fixed problems with startup of shaders on a Tesla device.
  • Visible shader resources are now fully synced on both host and target.
  • Sorting the shader list using headers now works properly.
  • Profiler now handles zero draw call cases.
  • Focus changes now working when debugging geometry shaders.
  • Parallel Nsight no longer requires D3DX libraries.
  • InputLayouts with APPEND_ALIGNED_ELEMENT offsets now display their geometry properly.
  • Captures now consistently show the right number of events.
  • Host no longer hangs if stepping off the first or last event.
  • Target better handles running DX11 apps on DX10 and DX10.1 hardware.
  • Depth stencil resources now show up correctly when at breakpoint.
  • Multiple capture and resume session on the same target doesn’t hang the host anymore.
  • Tooltips on event dependencies now show up.
  • Links between raw buffer views and their textures now lead to the correct pages.
  • Host scrubber now always synced with target HUD scrubbing.
  • Signals can now be reset/disabled in the HUD Configuration window.
  • Modifying semantic names strings previously used in the InputLayout is now allowed.
  • Pixel values in the Graphics Inspector no longer show clipped or normalized values.
  • Removed superfluous Process toolbar widget.
  • Profiling of dispatch calls now works.

Known Issues With The Graphics Debugger (history)

  • In Visual Studio 2010, closing the Frames page in the Graphics Inspector while the frame capture is paused, causes Visual Studio to crash.
  • Replay-on-thread: Although fullscreen generally works, some applications may lose fullscreen status during the transition, and in windowed mode the replay window may occlude all other windows. If these issues become a problem, try changing the default behavior:
    1. When an application starts, press Ctrl-Z on the target to enable Nsight.
    2. Press T on the keyboard to toggle replay-on-thread.
  • In the Frame Profiler, the GPU Idle % column in the Draw Calls table is incorrectly always 0%.
  • The graphics debugger does not support the Reference Rasterizer (RefRast) tool, which is the CPU rasterizer provided by Microsoft. The graphics debugger will signal an error if the D3D10_CREATE_DEVICE_SWITCH_TO_REF constant is used in device creation.
  • You can set breakpoints in shaders only when performing remote debugging, not local debugging. Setting breakpoints when running the monitor process locally results in an unresponsive system.
  • To debug a shader that is written in HLSL , the source code needs to be compiled at runtime, typically when your application is starting up. You cannot debug a shader that is pre-compiled and loaded as a binary shader. This is because the graphics debugger would not be able to map the binary shader to the original HLSL source code. The graphics debugger currently supports the D3DXCompile and D3DXCompileShader compile functions (and their variants).
  • You cannot access or see some variables in the Watch window because of optimization performed by the HLSL compiler.
  • You cannot set breakpoints on certain lines of source code if the HLSL compiler optimizes the lines out.
  • The Graphics Focus Picker (Pixel tab) will sometimes show a garbled or blank image instead of properly rendering the current render target.
  • Forcing the target application to close through the task manager while in the frame debugger crashes the target application.
  • Expression evaluation and breakpoint conditions do not support HLSL built-in functions and vector and matrix expressions.
  • The following are limited in the graphics debugger:
    • Visualization of Depth-Stencil formats. We show the depth part of a DS format, but not the stencil.
    • Examination of integer-based textures (DXGI formats that end in SINT or UINT).
  • Before using the frame profiler, you must remove all breakpoints.
  • You cannot place a conditional statement on the same line as a discard statement. For example, executing the following statement with the graphics debugger:

       if (true) discard;

    results in the graphics debugger becoming unresponsive. You must write the source code so that there is a carriage return immediately before the discard statement.
  • When debugging pixel shaders on a Fermi-based device, the target system can become unresponsive.

 

 

Analysis Tools

New In The 1.5 September 2010 RC  Release (history)

  • NVIDIA Tools Extension library events have been improved with color, payload, and resource naming functions.
  • The Analyzer supports CUDA C/C++ trace and profiling on GeForce GTX460 GPUs.
  • GPU-side workloads from Direct3D and OpenGL draw calls can now be traced.
  • Support for GeForce GTX460 GPUs.
  • Ability to reconfigure an activity during a running session:
    • Support for switching between trace and CUDA kernel profiling.
    • Support for reconfiguring the trace options.
    • Support for changing selected performance counters.
  • Extended trace support to include:
    • CUDA 3.2 Driver API calls, kernel launches and memory copies.
    • GPU-side workload from Direct3D and OpenGL including frame switches, draw calls and dispatches.
  • Added new performance counters and derived statistics to the CUDA kernel profile activity.
  • Enhancements to the NVIDIA Tools Extension library:
    • Events can be categorized and colored.
    • Resources can be named to make traces more meaningful. Nameable resources include:
      • System threads.
      • CUDA and OpenCL objects.

 

Fixed In This Release Of The Analysis Tools  (history)

Analysis Report Issues Fixed

  • View setting of report tables, such as column order, column visibility and the sort order are persisted now.
  • Filtering is now supported for all table columns.
  • Fixed possible crash when working with the Column Chooser.
  • The performance counter columns of the CUDA Launches table are now labeled consistently with the activity page. These columns now also have tooltips.

Timeline Report Issues Fixed

  • Fixed flickering issue of the tooltips reported on some systems.
  • Fixed possible crash when working with the filter UI.
  • Further improvements to the rendering performance of the timeline.
  • Added new shortcuts to improve the timeline’s usability. Please refer to the new help dialog for more information. To open the help click on the document icon in the top right hand corner of the navigation bar and select "Show Help".

 

Known Issues In The Analysis Tools (history)

  • Do not start an analysis capture session when the CUDA debugger is paused on a breakpoint. Doing so can cause the system to crash.
  • Toggling the state of Windows Aero desktop while an analysis report table is open and visible on the screen, can cause Visual Studio to crash.

Analysis Activity Known Issues

  • It is only possible to configure the counters for a profile session for a single hardware architecture, such as GT200 or GF100. However, the counter selection of a profile analysis session is currently applied to all GPUs - independent of their architecture. Running a profile session for an application that uses multiple, different device architectures may fail to collect counters on one of the used devices or may even fail completely.
  • Capturing data from a 64-bit process launched from a 32-bit process is not supported.
  • Capturing data from a managed processes is not supported.
  • The stop collection timer is implemented in Visual Studio. The latency to communicate to the monitor and application can result in a longer duration than requested.
  • CPU Thread Trace
    If the Windows Kernel Event Provider is already in use when a new capture session is launched, the collected data may produce unexpected results. For best results ensure that no other kernel providers are running during an analysis session.
  • CUDA Trace
    • The Fermi architecture supports concurrent kernels. However, the trace tool does not support the tracing of concurrent kernels. For that reason, all kernels are forcefully serialized.
    • CUDA trace does not trace graphics interop or cuMemset commands.
    • When used with CUDA C Runtime API programs, CUDA Trace does not capture the Runtime API calls or the <<< >>> kernel launch syntax. Instead, the corresponding CUDA C Driver API calls are reported. Some of the CUDA C Driver API calls that are executed as part of the CUDA Runtime calls may report errors, such as CUDA_ERROR_INVALID_CONTEXT, even though the usage of the CUDA Runtime API is valid.
    • When collecting trace information about CUDA kernel and memory tranfers, sometimes the report file will not contain information about the launches and memory copies. If an analysis report does not contain this information, it is likely that the collection time was not sufficient. This can happen because only certain events trigger the collection of information about CUDA launches and memory copies. The triggering events include:
      • 256 launches or 256 memory copies outstanding,
      • a call to cuCtxSynchronize()/cudaThreadSynchronize()
      • a call to cuCtxDestroy()/cudaThreadExit() 
      • an internal synchronize occurs.
      Remedy: On the Summary Report page, if either the number of Launches or the number of Memory Copies is less than 256, increase the collection time or add a call to cuCtxSynchronize() to your code. We recommend removing the call from the release-version of your build.
  • CUDA Profiler
    • On Tesla GPUs, branch counters include __syncthreads().
    • The CUDA Launches counter column settings are not persisted.
    • Profile Trigger increments by 1 per warp not by 1 per active thread.
    • The Fermi CUDA profiler currently has limitations on which counters can be collected at the same time.
    • The CUDA profiler only supports collecting counters from one context per device.
  • OpenCL
    • The end timestamp can sometimes be recorded significantly after the completion of a command. If this occurs, adding a clFlush after specific command will fix the timestamp.
    • The start/end range for memory read and write commands includes both host and device time. CUDA start/end range only includes device time.
    • Viewing OpenCL Source or Binary code from the OpenCL Programming Builds or OpenCL Program Summary creates a temporary file in %TMP%. The temporary file is not deleted when the file is closed.
    • OpenCL reports occasionally do not contain device commands. This can occur if the OpenCL context/queue is not released or less than 512 events occurred during a capture.
  • DirectX/OpenGL Trace
    • Graphics workload information, such as draw calls and dispatches, are output in groups of 1024 workload events. As a consequence, a report will not contain any graphics workload information if an insufficient number of draw calls occurred during a capture. Increasing the capture duration will help to work around this limitation.
  • In a Detailed Report, filtering table rows does not support filtering strings containing a comma (,). This primarily affects users attempting to filter launch dimensions or strings in Tools Extensions Events, Performance Markers, and compilation output.
  • Running an analysis session with OpenGL API trace on and manually closing the target application (by hitting the x button) will sometimes crash the target application.
  • Using the "After Specified Time..." field to control capture time does not result in exact capture times. This is due to small, unpredictable latency in communication time between the analysis tools, the Parallel Nsight monitor, and the target application.
  • Parallel Nsight does not automatically support capturing trace data of batch files. However, you can collect data from batch files by defining an Activity with the following settings, where MyBatchFile.bat is the name of the batch file you want to analyze:

  • Activity Type set to System Trace.
  • Application:  C:\Windows\System32\cmd.exe
  • Arguments:  /C MyBatchFile.bat

Analysis Report Known Issues

  • Toggling Windows Aero Color Scheme while an Analysis Report is open can cause Visual Studio to crash.
  • Changes made to the performance counter columns on the CUDA Launches settings do not persist. That is, these columns always appear in their default position when the page is opened.
  • If 2 different host computers use the same remote target machine, it is possible that the 2 machines could generate the same report directory. This would be confusing because reports from the 2 machines would be mixed together. Although unlikely, this can occur when 2 different machines analyze an application of the same name. The Parallel Nsight analysis tools on the host machine create the directory name based on the name of the application.
  • In a detailed report, the filtering of table rows does not support strings that have a comma (,). This primarily affects users attempting to filter launch dimensions or strings in Tools Extensions Events, Performance Markers, and compilation output.
  • Viewing OpenCL Source or Binary code from the OpenCL Programming Builds or OpenCL Program Summary creates a temporary file in %TMP%. The temporary file is not deleted when the file is closed.

Timeline Known Issues

  • There can be an error of approximately 1 microsecond between CPU events and GPU events.
  • Percentages displayed in the row labels and tool tips are based upon the full capture time.
  • The mouse forward and back buttons cannot be used to navigate the report page system.
  • CTRL+- toggles to the previous document instead of Zooming Out.
  • CTRL + SHIFT + 0 for resetting a row’s height to the default might not work on Vista systems. For more information, see http://support.microsoft.com/kb/967893.
  • Double-clicking on a row containing a line/area graph that also has children will expand/collapse the row as opposed to increasing the height to 66% of the view.
  • Using VNC (virtual network computing) software to remotely open a Timeline Report can cause Visual Studio to crash.

 

 

 

Release Notes Rev.1.5.100924