Back to home page

EIC code displayed by LXR

 
 

    


File indexing completed on 2026-04-09 07:49:02

0001 #!/bin/bash -l 
0002 usage(){ cat << EOU
0003 cxs_min_igs.sh : Standalone pure Opticks Optical Propagation using sequence of 1000 input genstep files 
0004 ==========================================================================================================
0005 
0006 This was used to hunt down the cause of VRAM memory leak that incremented
0007 memory usage by 14kb every launch.  See ~/opticks/notes/issues/okjob_GPU_memory_leak.rst 
0008 
0009 ::
0010 
0011    ~/o/CSGOptiX/cxs_min_igs.sh 
0012 
0013 * skipping the launch, dont see the leak : GPU mem stays 1283 MiB
0014 * with the launch, clear continuous growth from 1283 MiB across 1000 evt 
0015 * skipping the gather only (not the launch) still leaking the same
0016 * OPTICKS_MAX_BOUNCE=0 ie only generate, seems no difference to leak 
0017 * making the launch do "nothing" with rg_dummy : curiously increases the leak by 1.7x
0018 * scale with MAX_PHOTONS ? NO, changing MAX_PHOTONS makes no difference to size of leak 
0019 * try default stream for the launch ? perhaps then only one stream for all launches ?
0020 
0021   * BINGO : USING THE DEFAULT STREAM FOR ALL optixLaunch AVOIDS THE LEAK  
0022 
0023 
0024 Forum hunt
0025 ------------
0026 
0027 * https://forums.developer.nvidia.com/search?q=optix%20GPU%20memory%20leak%20
0028 * https://forums.developer.nvidia.com/search?q=optix%20memory%20
0029 * https://forums.developer.nvidia.com/t/cudastreamcreate-calls-in-optixhello-and-optixtriangle-samples/239688
0030 
0031 
0032 BINGO : Forum message that enabled finding the 14kb/launch VRAM leak
0033 ---------------------------------------------------------------------
0034 
0035 * https://forums.developer.nvidia.com/t/cudastreamcreate-calls-in-optixhello-and-optixtriangle-samples/239688
0036 
0037 CSGOptiX.cc::
0038 
0039     1031     if(DEBUG_SKIP_LAUNCH == false)
0040     1032     {
0041     1033         CUdeviceptr d_param = (CUdeviceptr)Params::d_param ; ;
0042     1034         assert( d_param && "must alloc and upload params before launch");
0043     1035 
0044     1036         /*
0045     1037         // this way leaking 14kb for every launch 
0046     1038         CUstream stream ;
0047     1039         CUDA_CHECK( cudaStreamCreate( &stream ) );
0048     1040         OPTIX_CHECK( optixLaunch( pip->pipeline, stream, d_param, sizeof( Params ), &(sbt->sbt), width, height, depth ) );
0049     1041         */
0050     1042 
0051     1043         // Using the default stream seems to avoid 14k VRAM leak at every launch. 
0052     1044         // Does that mean every launch gets to use the same single default stream ?  
0053     1045         CUstream stream = 0 ;
0054     1046         OPTIX_CHECK( optixLaunch( pip->pipeline, stream, d_param, sizeof( Params ), &(sbt->sbt), width, height, depth ) );
0055     1047 
0056     1048         CUDA_SYNC_CHECK();
0057     1049         // see CSG/CUDA_CHECK.h the CUDA_SYNC_CHECK does cudaDeviceSyncronize
0058     1050         // THIS LIKELY HAS LARGE PERFORMANCE IMPLICATIONS : BUT NOT EASY TO AVOID (MULTI-BUFFERING ETC..)  
0059     1051     }
0060 
0061 
0062 * https://forums.developer.nvidia.com/t/cudastreamcreate-calls-in-optixhello-and-optixtriangle-samples/239688
0063 
0064 kebiro
0065 Jan 16 '23
0066 
0067 Hi,
0068 
0069 I have a small suggestion to make.
0070 
0071 The above two samples issue a call to cudaStreamCreate() just before the call
0072 to optixLaunch().
0073 
0074 Here is where I see the problem: these two samples in particular are likely to
0075 act as a starting point for a lot of people, so I imagine that the call to
0076 cudaStreamCreate() can easily creep into the main loop, potentially causing a
0077 memory leak if it’s not destroyed. At the very least this happened to me.
0078 
0079 So I’d suggest to either add a call to cudaStreamDestroy() just before exiting
0080 the “launch” block, or just use nullptr as the stream argument for
0081 optixLaunch().
0082 
0083 PS: on a side note I have another question. When running Nsight Compute via
0084 CLI, it will print context and stream information (e.g. >kernel<, >date<,
0085 Context 1, Stream 13). Where can I find this information in the GUI?
0086 
0087 
0088 dhart
0089 Moderator
0090 Jan 16 '23
0091 
0092 Hi @kebiro, welcome!
0093 
0094 Thanks for the suggestion, well take it under advisement. We do want to
0095 explicitly model using CUDA streams and encourage people to know how to work
0096 with streams as a best practice, but it’s a good point that it’s not being
0097 explicitly cleaned up in the samples. At the very least maybe we can add a
0098 comment.
0099 
0100 For the Nsight Compute question, I don’t know about context and stream IDs, but
0101 date and kernel and lots of other stats are available on the Session and
0102 Details pages, and the kernel to inspect is available in the “Resultdrop-down
0103 near the top left, next to the “Pagedrop down.
0104 
01050106 David.
0107 
0108 
0109 
0110 Usage
0111 ------
0112 
0113 (1) Start eyeballing/recording VRAM usage
0114 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
0115 
0116 
0117 1. Eyeballing nvidia-smi in another terminal::
0118 
0119     nvidia-smi -lms 500
0120 
0121 2.  NVML memory recording with smonitor.sh::
0122 
0123     ~/o/sysrap/smonitor.sh       # workstation: ctrl-C after the other terminal GPU process completed
0124 
0125 
0126 (2) Run the pure Opticks simulation using input gensteps (1000 files)
0127 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
0128 
0129 A large-ish number of events is needed so it runs for long enough (~10 seconds)
0130 to make a recording.::
0131 
0132    ~/o/CSGOptiX/cxs_min_igs.sh 
0133 
0134 
0135 (3) Stop the smonitor recording
0136 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
0137 
0138 ctrl-C the smonitor process writes the smonitor.npy array and exits 
0139 
0140 
0141 (4) Grab the recording and analyse on laptop
0142 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
0143 
0144 ::
0145 
0146 
0147     ~/o/sysrap/smonitor.sh grab  # laptop
0148 
0149     START=10 ~/o/sysrap/smonitor.sh ana   # START specifies where to measure gradient from 
0150 
0151     PUB=rg_default_stream START=10 ~/o/sysrap/smonitor.sh ana
0152     PUB=rg_default_stream START=10 ~/o/sysrap/smonitor.sh mpcap
0153     PUB=rg_default_stream START=10 ~/o/sysrap/smonitor.sh mppub
0154 
0155 
0156 Variations to pin down the leak
0157 ----------------------------------
0158 
0159 
0160 +------------+---------------------------------+--------------------------------------------------------+
0161 | variation  | (NVML) GPU memory increase      |                                                        |
0162 |            | across 1000 launches measured   |                                                        |
0163 |            | at each 0.1 s                   |                                                        |
0164 +============+=================================+========================================================+
0165 | rg         |        0.014 GB/s               |                                                        |
0166 +------------+---------------------------------+--------------------------------------------------------+
0167 | rg         |        0.016, 0.015 GB/s        | after starting to add eoj cleanup (no change expected) |
0168 +------------+---------------------------------+--------------------------------------------------------+
0169 | rg         |        0.014 GB/s               | OPTICKS_MAX_PHOTON M1->k10 makes no difference         |
0170 +------------+---------------------------------+--------------------------------------------------------+
0171 | rg         |        0.015 GB/s               | OPTICKS_MAX_PHOTON M1->M10 makes no difference         |
0172 |            |                                 | other than longer initialization time                  |
0173 +------------+---------------------------------+--------------------------------------------------------+
0174 | rg         |        0.000 GB/s               | After adopting the default CUDA Stream for optixLaunch | 
0175 +------------+---------------------------------+--------------------------------------------------------+
0176 | rg_dummy   |        0.025, 0.025 GB/s        | 1.7x bigger leak with do-nothing RG ?                  |
0177 +------------+---------------------------------+--------------------------------------------------------+
0178 
0179 
0180 Question : How to reduce GPU memory increment at each launch ?
0181 ------------------------------------------------------------------
0182 
0183 All the geometry+pipeline setup is done once at initialization. 
0184 So suppose the primary memory thing happening 
0185 at each launch is arranging stack for all the threads. 
0186 
0187 GB and byte differences:: 
0188 
0189     dgb      0.125  (usedGpuMemory_GB[sel][-1]-usedGpuMemory_GB[sel][0]) 
0190     db  124780544.000  (usedGpuMemory[sel][-1]-usedGpuMemory[sel][0]) 
0191 
0192     dt        8.760  (t[sel][-1]-t[sel][0]) 
0193 
0194     dgb/dt       0.014  
0195     db/dt   14244616.140  
0196 
0197 
0198 Divide by approx 1000 launches, gives ~14kb per launch::
0199 
0200     In [1]: 14244616.140/1000
0201     Out[1]: 14244.61614
0202 
0203 
0204 EOU
0205 }
0206 
0207 
0208 #export OPTIX_FORCE_DEPRECATED_LAUNCHER=1  ## seems no difference re leak 
0209 #export OPTICKS_NUM_EVENT=3   # reduce from default of 1000 for shakedown
0210 #export OPTICKS_EVENT_MODE=DebugLite 
0211 #export QSim__simulate_DEBUG_SKIP_LAUNCH=1 
0212 #export QSim__simulate_DEBUG_SKIP_GATHER=1 
0213 #export CSGOptiX__launch_DEBUG_SKIP_LAUNCH=1
0214 
0215 #export OPTICKS_MAX_BOUNCE=0  ## seems no difference re leak 
0216 #export OPTICKS_MAX_PHOTON=M10  ## change from default of 1M 
0217 
0218 
0219 #export PIP__createRaygenPG_DUMMY=1   # replace __raygen__rg with do nothing __raygen__rg_dummy 
0220 #export PIP=INFO 
0221 #export LIFECYCLE=1
0222 
0223 #export SEvt__MINIMAL=1
0224 
0225 
0226 SDIR=$(dirname $(realpath $BASH_SOURCE))
0227 TEST=input_genstep $SDIR/cxs_min.sh