File indexing completed on 2026-04-09 07:49:02
0001
0002 usage(){ cat << EOU
0003 cxs_min_igs.sh : Standalone pure Opticks Optical Propagation using sequence of 1000 input genstep files
0004 ==========================================================================================================
0005
0006 This was used to hunt down the cause of VRAM memory leak that incremented
0007 memory usage by 14kb every launch. See ~/opticks/notes/issues/okjob_GPU_memory_leak.rst
0008
0009 ::
0010
0011 ~/o/CSGOptiX/cxs_min_igs.sh
0012
0013 * skipping the launch, dont see the leak : GPU mem stays 1283 MiB
0014 * with the launch, clear continuous growth from 1283 MiB across 1000 evt
0015 * skipping the gather only (not the launch) still leaking the same
0016 * OPTICKS_MAX_BOUNCE=0 ie only generate, seems no difference to leak
0017 * making the launch do "nothing" with rg_dummy : curiously increases the leak by 1.7x
0018 * scale with MAX_PHOTONS ? NO, changing MAX_PHOTONS makes no difference to size of leak
0019 * try default stream for the launch ? perhaps then only one stream for all launches ?
0020
0021 * BINGO : USING THE DEFAULT STREAM FOR ALL optixLaunch AVOIDS THE LEAK
0022
0023
0024 Forum hunt
0025 ------------
0026
0027 * https://forums.developer.nvidia.com/search?q=optix%20GPU%20memory%20leak%20
0028 * https://forums.developer.nvidia.com/search?q=optix%20memory%20
0029 * https://forums.developer.nvidia.com/t/cudastreamcreate-calls-in-optixhello-and-optixtriangle-samples/239688
0030
0031
0032 BINGO : Forum message that enabled finding the 14kb/launch VRAM leak
0033 ---------------------------------------------------------------------
0034
0035 * https://forums.developer.nvidia.com/t/cudastreamcreate-calls-in-optixhello-and-optixtriangle-samples/239688
0036
0037 CSGOptiX.cc::
0038
0039 1031 if(DEBUG_SKIP_LAUNCH == false)
0040 1032 {
0041 1033 CUdeviceptr d_param = (CUdeviceptr)Params::d_param ; ;
0042 1034 assert( d_param && "must alloc and upload params before launch");
0043 1035
0044 1036 /*
0045 1037 // this way leaking 14kb for every launch
0046 1038 CUstream stream ;
0047 1039 CUDA_CHECK( cudaStreamCreate( &stream ) );
0048 1040 OPTIX_CHECK( optixLaunch( pip->pipeline, stream, d_param, sizeof( Params ), &(sbt->sbt), width, height, depth ) );
0049 1041 */
0050 1042
0051 1043 // Using the default stream seems to avoid 14k VRAM leak at every launch.
0052 1044 // Does that mean every launch gets to use the same single default stream ?
0053 1045 CUstream stream = 0 ;
0054 1046 OPTIX_CHECK( optixLaunch( pip->pipeline, stream, d_param, sizeof( Params ), &(sbt->sbt), width, height, depth ) );
0055 1047
0056 1048 CUDA_SYNC_CHECK();
0057 1049 // see CSG/CUDA_CHECK.h the CUDA_SYNC_CHECK does cudaDeviceSyncronize
0058 1050 // THIS LIKELY HAS LARGE PERFORMANCE IMPLICATIONS : BUT NOT EASY TO AVOID (MULTI-BUFFERING ETC..)
0059 1051 }
0060
0061
0062 * https://forums.developer.nvidia.com/t/cudastreamcreate-calls-in-optixhello-and-optixtriangle-samples/239688
0063
0064 kebiro
0065 Jan 16 '23
0066
0067 Hi,
0068
0069 I have a small suggestion to make.
0070
0071 The above two samples issue a call to cudaStreamCreate() just before the call
0072 to optixLaunch().
0073
0074 Here is where I see the problem: these two samples in particular are likely to
0075 act as a starting point for a lot of people, so I imagine that the call to
0076 cudaStreamCreate() can easily creep into the main loop, potentially causing a
0077 memory leak if it’s not destroyed. At the very least this happened to me.
0078
0079 So I’d suggest to either add a call to cudaStreamDestroy() just before exiting
0080 the “launch” block, or just use nullptr as the stream argument for
0081 optixLaunch().
0082
0083 PS: on a side note I have another question. When running Nsight Compute via
0084 CLI, it will print context and stream information (e.g. >kernel<, >date<,
0085 Context 1, Stream 13). Where can I find this information in the GUI?
0086
0087
0088 dhart
0089 Moderator
0090 Jan 16 '23
0091
0092 Hi @kebiro, welcome!
0093
0094 Thanks for the suggestion, we’ll take it under advisement. We do want to
0095 explicitly model using CUDA streams and encourage people to know how to work
0096 with streams as a best practice, but it’s a good point that it’s not being
0097 explicitly cleaned up in the samples. At the very least maybe we can add a
0098 comment.
0099
0100 For the Nsight Compute question, I don’t know about context and stream IDs, but
0101 date and kernel and lots of other stats are available on the Session and
0102 Details pages, and the kernel to inspect is available in the “Result” drop-down
0103 near the top left, next to the “Page” drop down.
0104
0105 –
0106 David.
0107
0108
0109
0110 Usage
0111 ------
0112
0113 (1) Start eyeballing/recording VRAM usage
0114 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
0115
0116
0117 1. Eyeballing nvidia-smi in another terminal::
0118
0119 nvidia-smi -lms 500
0120
0121 2. NVML memory recording with smonitor.sh::
0122
0123 ~/o/sysrap/smonitor.sh
0124
0125
0126 (2) Run the pure Opticks simulation using input gensteps (1000 files)
0127 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
0128
0129 A large-ish number of events is needed so it runs for long enough (~10 seconds)
0130 to make a recording.::
0131
0132 ~/o/CSGOptiX/cxs_min_igs.sh
0133
0134
0135 (3) Stop the smonitor recording
0136 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
0137
0138 ctrl-C the smonitor process writes the smonitor.npy array and exits
0139
0140
0141 (4) Grab the recording and analyse on laptop
0142 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
0143
0144 ::
0145
0146
0147 ~/o/sysrap/smonitor.sh grab
0148
0149 START=10 ~/o/sysrap/smonitor.sh ana
0150
0151 PUB=rg_default_stream START=10 ~/o/sysrap/smonitor.sh ana
0152 PUB=rg_default_stream START=10 ~/o/sysrap/smonitor.sh mpcap
0153 PUB=rg_default_stream START=10 ~/o/sysrap/smonitor.sh mppub
0154
0155
0156 Variations to pin down the leak
0157 ----------------------------------
0158
0159
0160 +------------+---------------------------------+--------------------------------------------------------+
0161 | variation | (NVML) GPU memory increase | |
0162 | | across 1000 launches measured | |
0163 | | at each 0.1 s | |
0164 +============+=================================+========================================================+
0165 | rg | 0.014 GB/s | |
0166 +------------+---------------------------------+--------------------------------------------------------+
0167 | rg | 0.016, 0.015 GB/s | after starting to add eoj cleanup (no change expected) |
0168 +------------+---------------------------------+--------------------------------------------------------+
0169 | rg | 0.014 GB/s | OPTICKS_MAX_PHOTON M1->k10 makes no difference |
0170 +------------+---------------------------------+--------------------------------------------------------+
0171 | rg | 0.015 GB/s | OPTICKS_MAX_PHOTON M1->M10 makes no difference |
0172 | | | other than longer initialization time |
0173 +------------+---------------------------------+--------------------------------------------------------+
0174 | rg | 0.000 GB/s | After adopting the default CUDA Stream for optixLaunch |
0175 +------------+---------------------------------+--------------------------------------------------------+
0176 | rg_dummy | 0.025, 0.025 GB/s | 1.7x bigger leak with do-nothing RG ? |
0177 +------------+---------------------------------+--------------------------------------------------------+
0178
0179
0180 Question : How to reduce GPU memory increment at each launch ?
0181 ------------------------------------------------------------------
0182
0183 All the geometry+pipeline setup is done once at initialization.
0184 So suppose the primary memory thing happening
0185 at each launch is arranging stack for all the threads.
0186
0187 GB and byte differences::
0188
0189 dgb 0.125 (usedGpuMemory_GB[sel][-1]-usedGpuMemory_GB[sel][0])
0190 db 124780544.000 (usedGpuMemory[sel][-1]-usedGpuMemory[sel][0])
0191
0192 dt 8.760 (t[sel][-1]-t[sel][0])
0193
0194 dgb/dt 0.014
0195 db/dt 14244616.140
0196
0197
0198 Divide by approx 1000 launches, gives ~14kb per launch::
0199
0200 In [1]: 14244616.140/1000
0201 Out[1]: 14244.61614
0202
0203
0204 EOU
0205 }
0206
0207
0208
0209
0210
0211
0212
0213
0214
0215
0216
0217
0218
0219
0220
0221
0222
0223
0224
0225
0226 SDIR=$(dirname $(realpath $BASH_SOURCE))
0227 TEST=input_genstep $SDIR/cxs_min.sh