eic-opticks/CSGOptiX/cxs_min_igs.sh

0001 #!/bin/bash -l
0002 usage(){ cat << EOU
0003 cxs_min_igs.sh : Standalone pure Opticks Optical Propagation using sequence of 1000 input genstep files
0004 ==========================================================================================================
0005
0006 This was used to hunt down the cause of VRAM memory leak that incremented
0007 memory usage by 14kb every launch.  See ~/opticks/notes/issues/okjob_GPU_memory_leak.rst
0008
0009 ::
0010
0011    ~/o/CSGOptiX/cxs_min_igs.sh
0012
0013 * skipping the launch, dont see the leak : GPU mem stays 1283 MiB
0014 * with the launch, clear continuous growth from 1283 MiB across 1000 evt
0015 * skipping the gather only (not the launch) still leaking the same
0016 * OPTICKS_MAX_BOUNCE=0 ie only generate, seems no difference to leak
0017 * making the launch do "nothing" with rg_dummy : curiously increases the leak by 1.7x
0018 * scale with MAX_PHOTONS ? NO, changing MAX_PHOTONS makes no difference to size of leak
0019 * try default stream for the launch ? perhaps then only one stream for all launches ?
0020
0021   * BINGO : USING THE DEFAULT STREAM FOR ALL optixLaunch AVOIDS THE LEAK
0022
0023
0024 Forum hunt
0025 ------------
0026
0027 * https://forums.developer.nvidia.com/search?q=optix%20GPU%20memory%20leak%20
0028 * https://forums.developer.nvidia.com/search?q=optix%20memory%20
0029 * https://forums.developer.nvidia.com/t/cudastreamcreate-calls-in-optixhello-and-optixtriangle-samples/239688
0030
0031
0032 BINGO : Forum message that enabled finding the 14kb/launch VRAM leak
0033 ---------------------------------------------------------------------
0034
0035 * https://forums.developer.nvidia.com/t/cudastreamcreate-calls-in-optixhello-and-optixtriangle-samples/239688
0036
0037 CSGOptiX.cc::
0038
0039     1031     if(DEBUG_SKIP_LAUNCH == false)
0040     1032     {
0041     1033         CUdeviceptr d_param = (CUdeviceptr)Params::d_param ; ;
0042     1034         assert( d_param && "must alloc and upload params before launch");
0043     1035
0044     1036         /*
0045     1037         // this way leaking 14kb for every launch
0046     1038         CUstream stream ;
0047     1039         CUDA_CHECK( cudaStreamCreate( &stream ) );
0048     1040         OPTIX_CHECK( optixLaunch( pip->pipeline, stream, d_param, sizeof( Params ), &(sbt->sbt), width, height, depth ) );
0049     1041         */
0050     1042
0051     1043         // Using the default stream seems to avoid 14k VRAM leak at every launch.
0052     1044         // Does that mean every launch gets to use the same single default stream ?
0053     1045         CUstream stream = 0 ;
0054     1046         OPTIX_CHECK( optixLaunch( pip->pipeline, stream, d_param, sizeof( Params ), &(sbt->sbt), width, height, depth ) );
0055     1047
0056     1048         CUDA_SYNC_CHECK();
0057     1049         // see CSG/CUDA_CHECK.h the CUDA_SYNC_CHECK does cudaDeviceSyncronize
0058     1050         // THIS LIKELY HAS LARGE PERFORMANCE IMPLICATIONS : BUT NOT EASY TO AVOID (MULTI-BUFFERING ETC..)
0059     1051     }
0060
0061
0062 * https://forums.developer.nvidia.com/t/cudastreamcreate-calls-in-optixhello-and-optixtriangle-samples/239688
0063
0064 kebiro
0065 Jan 16 '23
0066
0067 Hi,
0068
0069 I have a small suggestion to make.
0070
0071 The above two samples issue a call to cudaStreamCreate() just before the call
0072 to optixLaunch().
0073
0074 Here is where I see the problem: these two samples in particular are likely to
0075 act as a starting point for a lot of people, so I imagine that the call to
0076 cudaStreamCreate() can easily creep into the main loop, potentially causing a
0077 memory leak if it’s not destroyed. At the very least this happened to me.
0078
0079 So I’d suggest to either add a call to cudaStreamDestroy() just before exiting
0080 the “launch” block, or just use nullptr as the stream argument for
0081 optixLaunch().
0082
0083 PS: on a side note I have another question. When running Nsight Compute via
0084 CLI, it will print context and stream information (e.g. >kernel<, >date<,
0085 Context 1, Stream 13). Where can I find this information in the GUI?
0086
0087
0088 dhart
0089 Moderator
0090 Jan 16 '23
0091
0092 Hi @kebiro, welcome!
0093
0094 Thanks for the suggestion, we’ll take it under advisement. We do want to
0095 explicitly model using CUDA streams and encourage people to know how to work
0096 with streams as a best practice, but it’s a good point that it’s not being
0097 explicitly cleaned up in the samples. At the very least maybe we can add a
0098 comment.
0099
0100 For the Nsight Compute question, I don’t know about context and stream IDs, but
0101 date and kernel and lots of other stats are available on the Session and
0102 Details pages, and the kernel to inspect is available in the “Result” drop-down
0103 near the top left, next to the “Page” drop down.
0104
0105 –
0106 David.
0107
0108
0109
0110 Usage
0111 ------
0112
0113 (1) Start eyeballing/recording VRAM usage
0114 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
0115
0116
0117 1. Eyeballing nvidia-smi in another terminal::
0118
0119     nvidia-smi -lms 500
0120
0121 2.  NVML memory recording with smonitor.sh::
0122
0123     ~/o/sysrap/smonitor.sh       # workstation: ctrl-C after the other terminal GPU process completed
0124
0125
0126 (2) Run the pure Opticks simulation using input gensteps (1000 files)
0127 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
0128
0129 A large-ish number of events is needed so it runs for long enough (~10 seconds)
0130 to make a recording.::
0131
0132    ~/o/CSGOptiX/cxs_min_igs.sh
0133
0134
0135 (3) Stop the smonitor recording
0136 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
0137
0138 ctrl-C the smonitor process writes the smonitor.npy array and exits
0139
0140
0141 (4) Grab the recording and analyse on laptop
0142 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
0143
0144 ::
0145
0146
0147     ~/o/sysrap/smonitor.sh grab  # laptop
0148
0149     START=10 ~/o/sysrap/smonitor.sh ana   # START specifies where to measure gradient from
0150
0151     PUB=rg_default_stream START=10 ~/o/sysrap/smonitor.sh ana
0152     PUB=rg_default_stream START=10 ~/o/sysrap/smonitor.sh mpcap
0153     PUB=rg_default_stream START=10 ~/o/sysrap/smonitor.sh mppub
0154
0155
0156 Variations to pin down the leak
0157 ----------------------------------
0158
0159
0160 +------------+---------------------------------+--------------------------------------------------------+
0161 | variation  | (NVML) GPU memory increase      |                                                        |
0162 |            | across 1000 launches measured   |                                                        |
0163 |            | at each 0.1 s                   |                                                        |
0164 +============+=================================+========================================================+
0165 | rg         |        0.014 GB/s               |                                                        |
0166 +------------+---------------------------------+--------------------------------------------------------+
0167 | rg         |        0.016, 0.015 GB/s        | after starting to add eoj cleanup (no change expected) |
0168 +------------+---------------------------------+--------------------------------------------------------+
0169 | rg         |        0.014 GB/s               | OPTICKS_MAX_PHOTON M1->k10 makes no difference         |
0170 +------------+---------------------------------+--------------------------------------------------------+
0171 | rg         |        0.015 GB/s               | OPTICKS_MAX_PHOTON M1->M10 makes no difference         |
0172 |            |                                 | other than longer initialization time                  |
0173 +------------+---------------------------------+--------------------------------------------------------+
0174 | rg         |        0.000 GB/s               | After adopting the default CUDA Stream for optixLaunch |
0175 +------------+---------------------------------+--------------------------------------------------------+
0176 | rg_dummy   |        0.025, 0.025 GB/s        | 1.7x bigger leak with do-nothing RG ?                  |
0177 +------------+---------------------------------+--------------------------------------------------------+
0178
0179
0180 Question : How to reduce GPU memory increment at each launch ?
0181 ------------------------------------------------------------------
0182
0183 All the geometry+pipeline setup is done once at initialization.
0184 So suppose the primary memory thing happening
0185 at each launch is arranging stack for all the threads.
0186
0187 GB and byte differences::
0188
0189     dgb      0.125  (usedGpuMemory_GB[sel][-1]-usedGpuMemory_GB[sel][0])
0190     db  124780544.000  (usedGpuMemory[sel][-1]-usedGpuMemory[sel][0])
0191
0192     dt        8.760  (t[sel][-1]-t[sel][0])
0193
0194     dgb/dt       0.014
0195     db/dt   14244616.140
0196
0197
0198 Divide by approx 1000 launches, gives ~14kb per launch::
0199
0200     In [1]: 14244616.140/1000
0201     Out[1]: 14244.61614
0202
0203
0204 EOU
0205 }
0206
0207
0208 #export OPTIX_FORCE_DEPRECATED_LAUNCHER=1  ## seems no difference re leak
0209 #export OPTICKS_NUM_EVENT=3   # reduce from default of 1000 for shakedown
0210 #export OPTICKS_EVENT_MODE=DebugLite
0211 #export QSim__simulate_DEBUG_SKIP_LAUNCH=1
0212 #export QSim__simulate_DEBUG_SKIP_GATHER=1
0213 #export CSGOptiX__launch_DEBUG_SKIP_LAUNCH=1
0214
0215 #export OPTICKS_MAX_BOUNCE=0  ## seems no difference re leak
0216 #export OPTICKS_MAX_PHOTON=M10  ## change from default of 1M
0217
0218
0219 #export PIP__createRaygenPG_DUMMY=1   # replace __raygen__rg with do nothing __raygen__rg_dummy
0220 #export PIP=INFO
0221 #export LIFECYCLE=1
0222
0223 #export SEvt__MINIMAL=1
0224
0225
0226 SDIR=$(dirname $(realpath $BASH_SOURCE))
0227 TEST=input_genstep $SDIR/cxs_min.sh