source/sphenixprod/HOWTO.md

0001 ### Simplified steps to submit a production
0002
0003
0004 First clone the repository
0005
0006 ```bash
0007 $ git clone git@github.com:BNLNPPS/sphenixprod.git
0008 $ source sphenixprod/this_sphenixprod.sh
0009 ```
0010
0011 Then run a submission using the submission script `sphenixprod/create_submission.py`
0012
0013 For help, use `create_submission.py -h` and for testing purposes with verbose output use `create_submission.py -n`, using `-v`, `-vv`, or `-vvv` for more/less verbosity. Using `-n` will explicitly not submit anything
0014
0015 Example submission, which requires
0016
0017 1. a configuration yaml file
0018 2. a rule in that yaml file to run
0019 3. a run range
0020
0021 ```
0022 create_submission.py \
0023 --config ProdFlow/short/run3auau/v001_combining_run3_new_nocdbtag.yaml \
0024 --rule DST_TRIGGERED_EVENT_run3physics \
0025 --runs 69600 72000 \
0026 -vv -n
0027 ```
0028
0029 #### Processing Large Run Lists in Chunks
0030
0031 For large run lists, you can use the `--chunk-size` parameter to process runs in smaller batches. This provides faster feedback and allows submission to start sooner:
0032
0033 ```bash
0034 # Process runs in chunks of 100
0035 create_submission.py \
0036 --config ProdFlow/short/run3auau/v001_combining_run3_new_nocdbtag.yaml \
0037 --rule DST_TRIGGERED_EVENT_run3physics \
0038 --runs 69600 72000 \
0039 --chunk-size 100 \
0040 -vv -n
0041 ```
0042
0043 With `--andgo`, jobs from each chunk will be submitted as soon as that chunk completes processing:
0044
0045 ```bash
0046 # Process in chunks and submit each chunk automatically
0047 create_submission.py \
0048 --config config.yaml \
0049 --rule RULE_NAME \
0050 --runs 69600 72000 \
0051 --chunk-size 100 \
0052 --andgo \
0053 -vv
0054 ```
0055
0056 To run a production job for real, you need to first clean out the file catalog (FC) for your particular jobs. Jobs will not run if there are existing entries in the FC so as to not overwrite actively running jobs inadvertently
0057
0058 ```
0059 psql -d Production -h sphnxproddbmaster.sdcc.bnl.gov -U argouser -c "delete from production_status where run>=66517 and  run<=66685 and dsttype like 'DST_TRIGGERED_EVENT%' ;"
0060 ```
0061
0062 Before submitting, check / change / ask Kolja to change the autopilot schedule to make sure your freshly freed up job opportunities don't get picked up by the production dispatcher.
0063
0064 Then you can submit in one swoop with
0065
0066
0067 ```
0068 create_submission.py \
0069 --config ProdFlow/short/run3auau/v001_combining_run3_new_nocdbtag.yaml \
0070 --rule DST_TRIGGERED_EVENT_run3physics \
0071 --runs 69600 72000 \
0072 -vv -andgo
0073 ```
0074
0075 The condor ID will be printed out such that you can monitor the status of the job. Keep in mind that at most 50k jobs can be running on a particular node.
0076
0077 ###Steps for Complete Reset of a Production
0078
0079 First step: Stop the autopilot submission. For example, if we wanted to stop the autopilot for CLUSTER and downstream SEED jobs for run3pp full production, we would modify `ProdFlow/short/run3pp/run3pp_production.yaml` like the following, making sure to check that we're disabling all instances of the types of jobs we want to reset across all production hosts. The key is to turn off the submit, dstspider, and finishmon flags:
0080
0081 ```
0082   # TRKR_CLUSTER physics
0083   Full_TRKR_CLUSTER_run3pp:
0084     config: ppstreaming_from523_physics_ana526_2025p008_v001.yaml
0085     runs: [79300 90000]
0086     jobprio: 60
0087     submit: off
0088     dstspider: off
0089     finishmon: off
0090
0091   # TRKR_SEED physics
0092   Full_TRKR_SEED_run3pp:
0093     config: ppstreaming_from523_physics_ana526_2025p008_v001.yaml
0094     runs: [79300 90000]
0095     jobprio: 30
0096     submit: off
0097     dstspider: off
0098     finishmon: off
0099 ```
0100
0101 Next step: Now kill all queued jobs. If you start typing `condor_rm` and then press the up arrow a few times, you should get to a command with ` ... -const 'JobBatchName..."', that you can use (and modify with the correct batch name if it isn't right in the history). The full command should look something like the following if you can't find it, just make sure the syntax matches the specific job types you want to disable:
0102
0103 ```
0104 condor_rm -const 'JobBatchName=="main.Full_TRKR_CLUSTER_run3pp_run3pp_ana526_2025p008_v001"'
0105 ```
0106
0107 Next step: You should double-check we didn't miss any cron-generated create_submission processes that were on-going when you changed the steer file. That can be done with:
0108
0109 ```
0110 ps axuww|grep create
0111 ```
0112
0113 Cleanup step: We can do one of two things:
0114
0115 1. Delete all the output directories on lustre (and the logs on gpfs while we're at it)
0116 2. Leave the existing output on disk, it will be overwritten. However, the auto-spider is off, but there are probably `.root:nevents:...` files still lying around that would confuse the spider when it's restarted. In that case, you'd run a final dstspider command on both of these (with the run range we're interested in).
0117
0118 1 is the cleaner way to do it, but it could involve deleting millions of files and take a painfully long time. Best way is to use a combination of lustre tools, namely their own (less powerful) version of `find`, and a specific bare-bones version of `rm`, namely `munlink`. Building the command in pieces:
0119 Example directories to search:
0120 ```
0121 /sphenix/lustre01/sphnxpro/production/run3pp/physics/ana526_2025p008_v001/DST_TRKR_*/run_00079*
0122 ```
0123 You can use all kinds of constraints, like -name DST\*. We want just want all regular files (directories gove error messages), so the way to find it all is (The lfs uses the lustre tool instead of the linux tool.):
0124 ```
0125 lfs find [directories] -type f
0126 ```
0127 Finally, there's some dark magic you just need to note down or memorize. Print one result per line, with the right separator, select the right separator in `xargs`, then use `munlink`.
0128 Final complete line is (those are zeros in `print0` and `xargs -0`):
0129 ```
0130 lfs find /sphenix/lustre01/sphnxpro/production/run3pp/physics/ana526_2025p008_v001/DST_TRKR_*/run_00079* -type f -print0 | xargs -0 munlink
0131 ```
0132 You canpaste the line without the | xargs ... first to see that it returns the files yopu'd expect.
0133
0134 At the same time, do the same to the gpfs files. Frustratingly, munlink would become unlink, the `print0 ... xargs -0` stuff is different etc. But regular lfs is more powerful and I can just use `-delete`. Full line to use is:
0135 ```
0136 find /sphenix/data/data02/sphnxpro/production/run3pp/physics/ana526_2025p008_v001/DST_TRKR_*/run_00079* -type f -delete
0137 ```
0138
0139 Next step: clean out the Production db and FileCatalog. The production command will look something like the following adjusted for the specfic file types you want to remove:
0140
0141 ```
0142 psql -d Production -h sphnxproddbmaster.sdcc.bnl.gov -U argouser -c "delete from production_status where dstname like 'DST_TRKR_CLUSTER%run3pp%ana526%' and run between 79200 and 80000;"
0143 ```
0144
0145 The FileCatalog query can be opened with `psql -d FileCatalog -h sphnxdbmaster.sdcc.bnl.gov` and you will want to run something like the following adjusted for the conditions you want to remove:
0146
0147 ```
0148 delete from files
0149 USING datasets
0150 WHERE
0151   files.lfn=datasets.filename
0152 and
0153   datasets.dsttype='DST_TRKR_CLUSTER'
0154 and
0155   datasets.runnumber between 79200 and 80000
0156 and
0157   datasets.tag='ana526_2025p008_v001'
0158 ;
0159 ```
0160 If you need to find which other things you can cut on with your query, try running `select * from datasets` or from `files` and it should give you a printout of the available variables
0161
0162 You also need to delete from datasets as well, which will be the same as the above query but you change the first 2 lines to be just `delete from datasets`.
0163
0164 Lastly, check each of the 4 hosts to make sure there isn't anything that has been overlooked, then turn the automatic production flags back on!
0165