Warning, /swf-monitor/docs/EPICPROD_OPS.md is written in an unsupported language. File is not indexed.
0001 # ePIC Production Operations
0002
0003 This is operations documentation for submitting, monitoring, retrieving logs, etc. for official ePIC production using BNL PanDA on `pandaserver02.sdcc.bnl.gov`. This is the
0004 operations counterpart to the design docs: [PCS.md](PCS.md) (configuration),
0005 [JEDI_INTEGRATION.md](JEDI_INTEGRATION.md) (PCS→JEDI submission design),
0006 [EPICPROD_TASK_CATALOG.md](EPICPROD_TASK_CATALOG.md) (the task catalog), and
0007 [PRODUCTION_DEPLOYMENT.md](PRODUCTION_DEPLOYMENT.md) (deploying swf-monitor).
0008
0009 The `prun` submission path described here is the foundation the automated
0010 PCS submission builds on: it establishes that an operator's identity can submit
0011 official `group.EIC` production tasks. The first official `EIC.production`
0012 submission through this path was validated 2026-06-01 (jediTaskID 36439).
0013
0014 ## Identity and client
0015
0016 **EIC/production IAM subgroup.** Official production submission requires
0017 membership in the `EIC/production` subgroup in PanDA IAM
0018 (`panda-iam-doma.cern.ch`). Production managers request to be added. Membership
0019 is carried in the OIDC id_token `groups` claim — it must contain
0020 `EIC/production`. The IAM consent screen names the OAuth client ("panda robot"),
0021 not the subgroup; the subgroup scoping comes from `PANDA_AUTH_VO` plus the
0022 account's group membership, confirmed in the token, not on the consent screen.
0023
0024 **panda-client (`pclient`).** Lives at `~/pclient` on `pandaserver02` — a plain
0025 Python venv. Set up the environment with:
0026
0027 ```bash
0028 source ~/pclient/run/setup.sh
0029 ```
0030
0031 This sets `PANDA_AUTH=oidc`, default `PANDA_AUTH_VO=EIC`,
0032 `PANDA_CONFIG_ROOT=~/.pathena`, and the server URLs (`pandaserver01:25443`).
0033 `run/setup.sh` is bespoke (the SDCC URLs); `etc/panda/` is generated by the
0034 client install.
0035
0036 **Upgrading the client.** `~/pclient/bin/pip install -U panda-client`. The
0037 upgrade reaches PyPI only with the combined CA bundle in place (see *TLS / CA*).
0038 The previous install is preserved at `~/pclient-2025`.
0039
0040 ## Submitting an official production task
0041
0042 The prod-manager recipe, run from a clean working directory containing only the
0043 payload script:
0044
0045 ```bash
0046 cd <workdir> # sandbox = this dir; keep it minimal
0047 source ~/pclient/run/setup.sh
0048 export PANDA_AUTH_VO=EIC.production
0049 rm -rf "$PANDA_CONFIG_ROOT/.token" # force a fresh subgroup-scoped token
0050 prun --exec "./my_script.sh" --official \
0051 --outDS group.EIC.$(uuidgen) \
0052 --nJobs 1 --expertOnly_skipScout \
0053 --vo wlcg --site BNL_PanDA_1 \
0054 --prodSourceLabel test \
0055 --workingGroup EIC.production \
0056 --noBuild --outputs myout.txt
0057 ```
0058
0059 `--official` with the `group.EIC` output scope is what exercises the
0060 `EIC.production` privilege; an authorization rejection means the identity is not
0061 in the subgroup. `--prodSourceLabel test` makes this a test-label task carrying
0062 the production identity — the right shape for a capability check.
0063 `--expertOnly_skipScout` is valid but hidden from `prun --help`.
0064
0065 **Authentication is interactive and must run in a real shell.** Deleting the
0066 token forces a fresh OIDC device flow: `prun` prints a verification URL, then
0067 prompts `Ready to get ID token? [y/n]`. It cannot be backgrounded or driven by a
0068 non-interactive process (it reads the prompt from stdin and will hit EOF).
0069 Open the URL, authenticate, consent as `EIC.production`, answer `y`. The token
0070 is then cached under `$PANDA_CONFIG_ROOT` and reused by subsequent commands.
0071
0072 **Result.** `prun` prints `jediTaskID=<N>`. Note that PanDA records the task's
0073 `workinggroup` as `EIC` even though `--workingGroup EIC.production` was passed —
0074 the production dimension is the IAM role, not the working-group field.
0075
0076 ## Re-submitting after a broken submission
0077
0078 Once a PCS task records a `jediTaskID`, the submit path refuses a second
0079 submission (`prodtask_submit_request` raises while `panda_task_id` is set), and
0080 the task page shows the PanDA link in place of the Submit control. A submission
0081 that broke or aborted PanDA-side therefore leaves the task pinned to a dead task
0082 ID with no way forward. The **Reset submission** button (owner-only, shown beside
0083 the PanDA link on both the task page and the compose panel) clears that:
0084 `panda_task_id → None`, `status → draft`. The task returns to the buildable
0085 lifecycle and Submit goes live again. Reset only detaches the reference — it does
0086 not stop or delete the PanDA task, since the web tier holds no PanDA credential;
0087 abort the dead task in PanDA separately if needed.
0088
0089 This is a commissioning-era recovery affordance. Gate or remove it once
0090 submissions are reliable, so a submitted task is not casually detached.
0091
0092 ## TLS / CA — pip and Rucio from this host
0093
0094 BNL internal services (`*.sdcc.bnl.gov`: PanDA, swf-monitor, Rucio) use a
0095 private CA captured in `swf-monitor/full-chain.pem`. `~/.env` points
0096 `REQUESTS_CA_BUNDLE` there so `requests`-based tools trust those servers. That
0097 bundle lacks public roots, so on its own it makes `pip`/`requests` reject
0098 PyPI's valid public certificate (`CERTIFICATE_VERIFY_FAILED`), and it overrides
0099 even an explicit `--cert`.
0100
0101 Resolved by a combined bundle = system public roots + the BNL chain, at
0102 `/data/wenauseic/certs/ca-bundle-combined.pem`, with `REQUESTS_CA_BUNDLE` and
0103 `SSL_CERT_FILE` pointing at it (set in `~/.env`, auto-rebuilt when either source
0104 changes). Both PyPI and `*.sdcc.bnl.gov` then validate against one trust store.
0105
0106 ## Monitoring
0107
0108 ePIC-tailored task and job views in swf-monitor:
0109
0110 | | Path |
0111 |---|---|
0112 | Task | `panda/tasks/<jeditaskid>/` |
0113 | Job | `panda/jobs/<pandaid>/` |
0114 | System status | `panda/system/` |
0115
0116 - Internal (BNL CILogon): `https://pandaserver02.sdcc.bnl.gov/swf-monitor/<path>`
0117 - External (swf-remote proxy, django login): `https://epic-devcloud.org/prod/<path>`
0118 - Generic BigPanDA: `https://pandamon01.sdcc.bnl.gov/task/<jeditaskid>/`
0119
0120 System status is documented in [SYSTEM_STATUS.md](SYSTEM_STATUS.md). The page
0121 and its JSON endpoint read cached DB rows refreshed by `epicprod_ops_agent`;
0122 they do not probe services from Apache requests. The production nav `System`
0123 item turns red when the cached aggregate is red or stale, so both
0124 `pandaserver02` and devcloud surface infrastructure trouble quickly.
0125
0126 ## Logs
0127
0128 Two distinct artifacts:
0129
0130 **Pilot (framework) log** — the HTCondor job stdout/stderr/batch from the
0131 harvester, served on the job page as `log_urls`. `study_job`
0132 (`src/monitor_app/panda/queries.py`) derives these by joining the job to its
0133 harvester worker (`harvester_workers.stdout/stderr/batchlog`), falling back to
0134 splitting the `pilotid` field. The pilot dumps the payload stdout *inline* near
0135 the end of its stdout, so for small payloads the script output is visible there.
0136
0137 **Payload log (complete)** — the job's Rucio log tarball,
0138 `<taskname>.log.<jeditaskid>.<seq>.log.tgz` in the `.log` dataset, replicated on
0139 `BNL_PROD_DISK_1`. It contains `payload.stdout`, `payload.stderr`,
0140 `pilotlog.txt`, `pandatracerlog.txt`, `PoolFileCatalog.xml`, and the pilot
0141 heartbeat/upload JSON.
0142
0143 Retrieving the tarball by hand:
0144
0145 ```bash
0146 # 1. resolve a replica (Rucio REST, x509, account panda, VO eic — the
0147 # rucio-eic-mcp-server pattern). The advertised replica is xrootd-only:
0148 # root://dcintdoor.sdcc.bnl.gov:1094/pnfs/sdcc.bnl.gov/eic/epic/disk/group/EIC/...
0149 # 2. fetch with the long-lived proxy and extract:
0150 export X509_USER_PROXY=/data/wenauseic/longproxy-for-rucio
0151 xrdcp -f "root://dcintdoor.sdcc.bnl.gov:1094/<pfn>" "$SWF_TMP_DIR/downloads/<lfn>"
0152 tar -xzf "$SWF_TMP_DIR/downloads/<lfn>" -C <dest>
0153 ```
0154
0155 The Rucio metadata path (resolve/list) is what the bots and the rucio MCP do
0156 routinely; fetching the *bytes* over xrootd is the added step, and it needs the
0157 proxy. `BNL_PROD_DISK_1` advertises only a `root://` door, so `xrdcp` (or
0158 `xrdfs`) is required — a plain https GET does not apply.
0159
0160 **Fallback — the BigPanDA filebrowser.** If the direct Rucio path is ever
0161 unavailable, the generic BigPanDA monitor exposes the same payload logs through
0162 its filebrowser, which does the Rucio download server-side and serves the files
0163 over HTTP without OIDC:
0164
0165 ```bash
0166 curl -sk -H 'Accept: application/json' \
0167 "https://pandamon01.sdcc.bnl.gov/filebrowser/?pandaid=<pandaid>"
0168 ```
0169
0170 returns a JSON file list with media links to download (the approach in Xin
0171 Zhao's `get_job_logs.sh`). Plan B only: it couples to the generic monitor and
0172 its datastore, the server-side fetch is synchronous (seconds), and it does not
0173 generalize to the other credentialed operations the ops agent exists for —
0174 hence the direct Rucio path above is preferred.
0175
0176 ## Scratch / cache area
0177
0178 `/data/swf-tmp` (`$SWF_TMP_DIR`) is the managed, evictable scratch root on the
0179 large `/data` volume — `/tmp` is tiny and on the squeezed root volume, so it is
0180 not used for testbed scratch. Layout:
0181
0182 ```
0183 /data/swf-tmp/
0184 panda-logs/<jeditaskid>/<pandaid>/ # extracted job log members
0185 downloads/ # transient tarballs
0186 ```
0187
0188 Owned `wenauseic:eic`, setgid, world-readable so a web view can serve cached
0189 output without holding any credential. Contents are reclaimable (logs are
0190 immutable and re-fetchable from Rucio) and pruned periodically.
0191
0192 ## Payload-log retrieval
0193
0194 A complete, clean payload log is one click from the job page, served from the
0195 `$SWF_TMP_DIR/panda-logs/<jeditaskid>/<pandaid>/` cache. Three pieces:
0196
0197 - **Doer** — `scripts/cache-payload-log.py`: resolves the log DID's replica
0198 (Rucio REST, x509, account `panda`), `xrdcp`s the tarball (bounded by
0199 `XRDCP_TIMEOUT`), extracts the members into the cache, and writes a `.done`
0200 sentinel last. Hit and skip are keyed on `.done`, never on a single member — a
0201 log may legitimately lack `payload.stdout`. Usable standalone, by cron, or by
0202 the agent.
0203 - **Agent** — `agents/epicprod_ops_agent.py`: an always-on `BaseAgent`
0204 (`agent_type=PRODOPS`) on the anycast queue `/queue/epicprod.ops`. It runs as
0205 `wenauseic`, so it alone holds the Rucio proxy and runs xrootd, under a fixed
0206 `prodops` namespace (from `agents/prodops.toml`) — a system singleton,
0207 identifiable as `prodops` in the monitor, that every caller addresses
0208 explicitly (`namespace: prodops` in the message; foreign-namespace messages are
0209 filtered out). It dispatches by `msg_type`; `fetch_payload_log` invokes the doer
0210 under a timeout and, on failure or timeout, records an `.error` marker (attempt
0211 count + reason) in the cache dir; `health_ping` replies `pong`; `shutdown` is
0212 the deliberate-stop back door (below). A new capability is a new handler.
0213 - **View** — `panda_payload_log` (`panda/jobs/<pandaid>/payload-log/`, linked
0214 from a "Payload Log" card on the job page): a hit (`.done`) serves the
0215 extracted log as text; a miss publishes `fetch_payload_log` to the agent via
0216 the existing `ActiveMQConnectionManager` and returns `202`. A prior failure is
0217 surfaced with its reason; after `EPICPROD_MAX_FETCH_ATTEMPTS` (default 3) the
0218 view stops re-triggering and reports the error (operator override: `?force=1`).
0219 The web tier never touches the proxy or xrootd — it reads the world-readable
0220 cache and drops a message on the bus.
0221
0222 Flow on a miss: job view → publish → agent (real time) → resolve + `xrdcp` +
0223 extract → cache → refresh serves it. No polling.
0224
0225 ### Running it
0226
0227 The agent is a systemd service like the `swf-*-bot` units — reference unit
0228 `epicprod-ops-agent.service` in the repo root: `User=wenauseic`,
0229 `Restart=always`, `RestartSec=15`, burst-capped (`StartLimitBurst=5` per
0230 `StartLimitIntervalSec=120`), `enable`d for boot. A persistent agent that keeps
0231 exiting is sick whatever its exit code, so the burst cap lets it land in `failed`
0232 (visible) instead of flapping forever. It runs from the deploy tree
0233 (`/opt/swf-monitor/current`) off `production.env`, which supplies everything it
0234 needs — `REQUESTS_CA_BUNDLE` (the combined BNL+public bundle, for the doer's Rucio
0235 REST), `X509_USER_PROXY`, and the `ACTIVEMQ_*` / `SWF_*` vars. `SWF_MONITOR_URL`
0236 must carry the `/swf-monitor` app path: the cleaner-killer reads the agent
0237 registry at `{SWF_MONITOR_URL}/api/systemagents/`, and under cron `production.env`
0238 is the only environment (no `~/.env` overlay), so a bare host would reach the
0239 server root behind CILogon, not the API. The proxy
0240 (`/etc/swf-monitor/longproxy-for-rucio`) is `apache:eic`, group-readable, so the
0241 web tier (owner) and the agent (group `eic`) share one copy; the directory is
0242 setgid with a default ACL so a re-created proxy stays `eic`-readable.
0243 `SWF_TMP_DIR` is left to its default (`/data/swf-tmp`), which the agent and the
0244 web view share. Bring it back manually with `sudo systemctl restart
0245 epicprod-ops-agent`. Being a `BaseAgent` it registers and heartbeats to the
0246 monitor and logs to the monitor DB, so it appears in the agent list.
0247
0248 **Deploys and restarts** — the deploy script does *not* restart this unit (it
0249 reloads Apache and the ASGI/bot workers only). What the deploy changed decides
0250 whether that matters. The agent dispatches its doer scripts (`submit-evgen-task.py`
0251 → `evgen_panda_submit.py`, `cache-payload-log.py`, `rucio-snapshot-update.py`,
0252 `pcs-catalog-import.py`, …) as fresh subprocesses by absolute path into the deploy
0253 tree, and a branch deploy replaces that constant `branch-<branch>` release path in
0254 place — so a change to a **doer script** is live on the next dispatch with **no
0255 restart**. A change to the **agent module itself** (`agents/epicprod_ops_agent.py` —
0256 handlers, routing, the script-path constants, timeouts) or to its startup inputs
0257 (`prodops.toml`, `production.env`) is read into memory at startup and is **not**
0258 picked up until `sudo systemctl restart epicprod-ops-agent`. The running process
0259 also pins its cwd to the release inode it started from (deploys delete+recreate that
0260 dir, so `cwd` reads `(deleted)`); absolute-path dispatch is unaffected, but a
0261 periodic clean restart avoids any relative-path surprise.
0262
0263 **Deliberate stop** — two back doors, neither counted as a failure: `sudo
0264 systemctl stop epicprod-ops-agent` (host-level; SIGTERM unwinds BaseAgent's
0265 graceful path, and systemd does not restart an admin stop), and a `shutdown`
0266 message on the bus (`namespace: prodops`), which exits the agent with sentinel
0267 code 100 — `SuccessExitStatus=100` / `RestartPreventExitStatus=100` tell systemd
0268 to leave it stopped. Any other exit is a crash and is restarted (burst-capped).
0269
0270 A single cron job, the **cleaner-killer** (`scripts/prodops-cleaner-killer.py`,
0271 run via the `.sh` wrapper that loads `production.env` — parsed as literal
0272 `KEY=VALUE`, not bash-`source`d, because a Django/decouple env file carries
0273 unquoted `$ & ( )` in values such as `SECRET_KEY`), keeps the singleton honest —
0274 order matters:
0275 - **Reap** duplicates: keep only the systemd-managed instance — host-gated
0276 `SIGKILL` of any other live PRODOPS agent, identified by its registry-saved
0277 pid (the source `swf_kill_agent` uses), never a process-name match. On an
0278 anycast queue a second subscriber would *steal* requests; this is the
0279 autonomous, system-singleton counterpart to the interactive MCP `swf_kill_agent`.
0280 - **Liveness** (~2 min): with duplicates gone, an MQ `health_ping`→`pong` round
0281 trip (carrying `namespace: prodops`); no answer triggers `reset-failed` +
0282 `restart` — the `reset-failed` is needed to revive a unit that hit the burst
0283 cap. The check is over the bus deliberately — for a messaging service the
0284 message path *is* the health. A unit that exited deliberately (sentinel 100) is
0285 left stopped, not fought; a failed restart is an alarm matter.
0286 - **Prune** (daily, `--prune-days 30`): remove `panda-logs` entries older than
0287 30 days; the cache is reclaimable, a miss just re-fetches.
0288
0289 Install (root):
0290
0291 ```
0292 */2 * * * * /opt/swf-monitor/current/scripts/prodops-cleaner-killer.sh
0293 30 3 * * * /opt/swf-monitor/current/scripts/prodops-cleaner-killer.sh --no-liveness --prune-days 30
0294 ```
0295
0296 The liveness restart uses `sudo systemctl restart`, so the cron account needs
0297 passwordless sudo for that unit.
0298
0299 **Status:** deployed and live on `pandaserver02` (2026-06-01). The agent runs
0300 under the `prodops` namespace; the systemd unit (burst-capped, deliberate-stop
0301 sentinel) is installed and `enable`d; the cleaner-killer crons (reap + liveness
0302 `*/2`, prune daily) are in root's crontab. The doer/view round trip is verified
0303 against jediTaskID 36439, and reap + `health_ping`→`pong` are verified from the
0304 deployed path. Remaining: the activity health chip, and the matching
0305 `swf-remote` proxy entry for external access — see
0306 [External Access](EXTERNAL_ACCESS.md).
0307
0308 ### Future
0309
0310 Auto-notify the operator the moment the agent finishes, removing the manual
0311 refresh.