swf-monitor/docs/EPICPROD_OPS_AGENT.md

0001 # ePIC Production Operations Agent
0002
0003 `epicprod_ops_agent` is the always-on, credentialed executor for ePIC
0004 production. It performs the privileged production actions — submit to PanDA,
0005 stage logs from Rucio over xrootd, and the credentialed work to come — that the
0006 public web tier structurally cannot. Building out epicprod functionality is, in
0007 large part, adding capabilities to this agent: it is the intended instrument for
0008 programmatic work against the privileged production services.
0009
0010 This is a design/planning doc, peer to [PCS.md](PCS.md),
0011 [JEDI_INTEGRATION.md](JEDI_INTEGRATION.md),
0012 [EPICPROD_TASK_CATALOG.md](EPICPROD_TASK_CATALOG.md), and
0013 [PCS_DATASET_REQUEST_WORKFLOW.md](PCS_DATASET_REQUEST_WORKFLOW.md). Its
0014 operations counterpart — how to run, restart, and monitor the agent, and the
0015 concrete payload-log retrieval mechanics — is [EPICPROD_OPS.md](EPICPROD_OPS.md).
0016 The agent is built on the testbed's `swf_common_lib.base_agent.BaseAgent`, so it
0017 inherits testbed agent management and monitor visibility like the other agents.
0018
0019 ## Role — surfaces vs. executor
0020
0021 The system separates *presentation* and *API surface* from *privileged
0022 execution*:
0023
0024 - **Web tier (Apache/Django)** presents pages and holds **no credential**. It
0025   reads world-readable caches and the database, and it drops messages on the
0026   bus. It never runs a privileged client.
0027 - **REST and MCP** are thin peer API surfaces over the credential-free PCS
0028   service layer (`pcs.services`). REST serves the web UI and scripts; MCP is the
0029   LLM-facing API for bots and assistants. Both turn wire-format input into a
0030   service call — they are *surfaces*, not executors. **MCP is an LLM API, not an
0031   execution engine**: no PanDA, Rucio, or xrootd credential is ever wired into
0032   the MCP server or the web tier.
0033 - **The agent** is the **credentialed executor**. It runs as the production-ops
0034   user (currently `wenauseic`), so it alone holds the keys — the Rucio x509
0035   proxy, the PanDA OIDC token, xrootd — and it is the single chokepoint through
0036   which every privileged action passes, whatever surface triggered it.
0037
0038 A trigger may arrive from the web tier, from REST, from an MCP tool driven by a
0039 bot or assistant, or from cron. All of them resolve to one message on the
0040 agent's queue, and the agent does the credentialed work. Human-driven decisions
0041 drive deterministic execution; the agent is where that execution lives.
0042
0043 Concretely, this is what unblocks server-side submission. `JEDI_INTEGRATION.md`
0044 records that submission from swf-monitor was blocked because the web service has
0045 no PanDA identity. The agent removes the block without waiting for a robot
0046 account: running as the operator, it reuses the operator's cached production
0047 token and submits non-interactively. An OIDC service account remains the
0048 long-term path (it would let the agent submit as a robot rather than as the
0049 operator), but it is no longer a prerequisite for automated submission.
0050
0051 ## Credential boundary
0052
0053 The agent runs under the production-ops user, not the web service, putting
0054 ownership with production and keeping every credential out of the public-facing
0055 tier:
0056
0057 | Credential | Used for | Held by |
0058 |---|---|---|
0059 | PanDA OIDC token (cached, `EIC.production`) | `prun` submission to JEDI | agent |
0060 | Rucio x509 proxy (`longproxy-for-rucio`) | replica resolution, DID queries | agent |
0061 | xrootd (`xrdcp`/`xrdfs`) | fetching log/output bytes | agent |
0062
0063 The web tier's only interaction with privileged results is to **read a
0064 world-readable cache** the agent populates, or to **read the database record**
0065 the agent updates. It holds nothing and runs nothing privileged. The
0066 payload-log flow is the reference instance: the job page drops a
0067 `fetch_payload_log` message and later serves the extracted log from the cache;
0068 it never touches the proxy or xrootd.
0069
0070 **Where the agent writes — `$SWF_TMP_DIR`, never the deploy tree.** The agent
0071 runs as the production-ops user (`wenauseic`, group `eic`); the web tier runs as
0072 `apache`. Any artifact the agent produces for the web tier to read must live in
0073 the agent-writable, world-readable shared cache `$SWF_TMP_DIR` (`/data/swf-tmp`)
0074 — the tree the payload-log cache and the Rucio snapshots both use, with paths
0075 derived from the `SWF_TMP_DIR` setting. It must **not** live under the deploy
0076 tree (`/opt/swf-monitor/shared/...`), which is owned by `apache`: a doer that
0077 writes there passes when the web tier writes the file and fails the instant the
0078 agent takes over the write (`[Errno 13] Permission denied`). Like the
0079 external-safe trigger, this is a boundary the internal/dev path hides until the
0080 agent — not the web tier — performs the write.
0081
0082 ## Capability model
0083
0084 The agent is **event-driven, not polled** — the same low-latency model the
0085 testbed agents use, which matters as much for prod entities as for testbed
0086 ones.
0087
0088 - **Queue and identity.** It subscribes to the anycast control queue
0089   `/queue/epicprod.ops`; a single consumer handles each request exactly once. It
0090   runs under a fixed `prodops` namespace (from `agents/prodops.toml`) so it is
0091   identifiable as the system singleton in the monitor and every caller addresses
0092   it explicitly (`namespace: prodops`). Foreign-namespace messages are filtered
0093   out.
0094 - **Dispatch.** Each action is a `msg_type` routed to a `_handle_<msg_type>`
0095   method. **Growing the agent is adding a handler** — a new capability is a new
0096   `_handle_*`, registered in `KNOWN_TYPES`.
0097 - **The doer pattern.** A handler is a thin event front end; the actual work is
0098   delegated to a standalone **doer script** (`scripts/cache-payload-log.py`,
0099   `scripts/submit-prod-task.py`) run as a subprocess under a timeout. Each doer
0100   is usable on its own, by cron, or by the agent — the agent does not embed the
0101   logic. **PCS stays the single source of truth**: the submit doer *fetches* the
0102   `prun` command from the PCS artifact endpoint rather than rebuilding it.
0103 - **Robustness doctrine.** Every handler is **bounded** (a subprocess timeout)
0104   and **self-erroring** (it records its own failure where the triggering surface
0105   will see it — e.g. the `.error` marker in the payload-log cache). Handler
0106   exceptions are caught and logged; one sick capability never crashes the
0107   singleton. Control messages (`health_ping`, `shutdown`) do not flip the
0108   agent's processing state; work messages do, so the monitor shows the agent
0109   busy.
0110 - **Outcome conventions, no polling.** A result lands where the trigger's
0111   surface already reads it: the cache (payload log) or the `ProdTask` record via
0112   `record-submission` (submit). Liveness replies on the bus (`health_ping` →
0113   `pong`). Nothing polls.
0114
0115 ### Async execution — a BaseAgent capability
0116
0117 `BaseAgent` delivers messages on a single STOMP receiver thread, sequentially
0118 (`ack='auto'`), so a handler that blocks stalls every later message — including
0119 `health_ping`. A healthy `submit_task` returns in seconds, but the credentialed
0120 work coming next is not all fast: the campaign-provenance sweep is a genuinely
0121 long Rucio scan, and any privileged call can hang (this is distributed
0122 computing). While the receiver thread is blocked, the cleaner-killer's liveness
0123 ping (every ~2 min) goes unanswered and the watchdog restarts the unit —
0124 killing the in-flight work it was meant to protect.
0125
0126 The fix is **threads, not asyncio**. The work is blocking subprocess/socket I/O
0127 (`prun`, `xrdcp`, Rucio REST) and the stack is thread-based (stomp.py,
0128 subprocess); an asyncio agent would buy nothing here and would force every agent
0129 off the shared base. So the capability lives in `BaseAgent` itself — a
0130 bounded worker pool, reusable by all agents — exposed as
0131 `run_in_background(fn, *args, dedup_key=…, label=…)`. A handler enqueues its
0132 doer and returns, freeing the receiver thread at once.
0133
0134 It is **opt-in**, which is what protects the other agents: one that never calls
0135 `run_in_background` behaves exactly as before. The wrapper drives reentrant
0136 PROCESSING state (PROCESSING while any background work is in flight), catches and
0137 logs every exception (no silent worker death), and skips a call whose
0138 `dedup_key` is already running (the duplicate-work race that concurrency
0139 introduces). A send lock makes worker-thread bus sends safe; shutdown drains the
0140 pool.
0141
0142 In this agent, `fetch_payload_log`, `submit_task`, and `rucio_snapshot_update`
0143 enqueue via `run_in_background`; `health_ping` and `shutdown` stay inline. Because
0144 `BaseAgent` lives in `swf-common-lib` and ships to every agent through the venv
0145 chain, the other agents continue to heartbeat unchanged. The shared API is
0146 documented in the `swf-common-lib` README.
0147
0148 ## Building a new capability — the pattern
0149
0150 The testbed is becoming a live, automated, responsive production system, and this
0151 agent is the standard instrument for it. A new credentialed, slow, or hang-prone
0152 operation is not a new service — it is the same recipe, and the trigger comes
0153 first because it is the step most often gotten wrong:
0154
0155 1. **Choose an external-safe trigger.** Most collaborators reach the agent
0156    through the swf-remote face (`epic-devcloud.org`), where the proxy carries no
0157    session or CSRF and cannot relay a redirect (3xx → 502). Only two trigger
0158    shapes survive that hop:
0159    - a **GET** page-view that drops the message as a side-effect and returns a
0160      body (200/202), never a redirect — `fetch_payload_log` via the job page; or
0161    - a **POST to `/pcs/api/`**, authenticated by `X-Remote-User`
0162      (`TunnelAuthentication`, csrf-exempt) and returning **JSON**, never a
0163      redirect — `submit_task` via the REST `submit` action.
0164
0165    A page-view POST that relies on session+CSRF or ends in `redirect()` passes on
0166    the internal face and **fails through the proxy** — do not use it. See
0167    [EXTERNAL_ACCESS.md](EXTERNAL_ACCESS.md) → *Write actions and triggers*, and
0168    verify on `epic-devcloud.org`, not the internal face, which hides the
0169    constraint.
0170 2. **Handler + doer.** Add `_handle_<msg_type>` (validate, then enqueue) and a
0171    standalone `_do_<msg_type>` / `scripts/<doer>.py` that does the privileged
0172    work; register the type in `KNOWN_TYPES`. Any output the web tier will read
0173    goes under `$SWF_TMP_DIR`, not the deploy tree — see *Credential boundary*.
0174 3. **Run it in the background.** Long work goes through `run_in_background`
0175    (bounded pool, dedup, reentrant PROCESSING) so the receiver thread never
0176    blocks — see *Async execution* above.
0177 4. **Emit a completion event.** On success, publish a small event to
0178    `/topic/epictopic`; the existing SSE relay broadcasts it.
0179 5. **Push it to the browser.** The triggering page holds an `EventSource` and
0180    updates the moment the event arrives — no polling, no manual refresh. See
0181    [SSE_PUSH.md](SSE_PUSH.md).
0182
0183 The result is a button that fires a privileged action server-side under the
0184 agent's credentials and reports back live — internally, and (through the
0185 swf-remote streaming proxy) to remote collaborators. `fetch_payload_log` (GET
0186 side-effect) and `submit_task` (`/pcs/api/` POST) are the worked examples for the
0187 two trigger shapes; the campaign-provenance sweep is the next. Reach for this
0188 pattern before building a poller, a blocking handler, or anything that places a
0189 credential in the web tier.
0190
0191 ## Current capabilities
0192
0193 Verified against `agents/epicprod_ops_agent.py` and its doers, 2026-06-02:
0194
0195 | `msg_type` | Doer | Credential | Outcome | Timeout |
0196 |---|---|---|---|---|
0197 | `fetch_payload_log` | `cache-payload-log.py` | Rucio proxy + xrootd | extracted log members in `$SWF_TMP_DIR/panda-logs/<jeditaskid>/<pandaid>/`, `.done` on success / `.error` on failure | 180s |
0198 | `submit_task` | `submit-prod-task.py` | PanDA OIDC token (operator) | `jediTaskID` recorded on the `ProdTask` (`panda_task_id` + `status='submitted'`) via `record-submission` | 300s |
0199 | `rucio_snapshot_update` | `rucio-snapshot-update.py` | JLab Rucio userpass (public `eicread`) | current+last snapshot refreshed, produced datasets matched onto each task's `overrides['outputs']`; `rucio_snapshot_ready` pushed (ok true/false) | 900s |
0200 | `health_ping` | — | — | `pong` to `reply_to` | — |
0201 | `shutdown` | — | — | deliberate stop; exits `EXIT_DELIBERATE=100` so systemd leaves it down | — |
0202
0203 **`fetch_payload_log`** resolves the log DID's replica (Rucio REST, x509,
0204 account `panda`), `xrdcp`s the tarball, extracts the members into the cache, and
0205 writes a `.done` sentinel. A miss publishes the message; a hit serves from
0206 cache. On failure or timeout it writes an `.error` marker carrying the attempt
0207 count and reason, which the web view surfaces and uses to bound retries.
0208
0209 **`submit_task`** runs the same `prun` an operator runs, non-interactively: it
0210 GETs the `prun` command for the task from
0211 `/pcs/api/prod-tasks/command/?name=<name>&fmt=panda`, runs it in a clean sandbox
0212 under the panda-client environment with the cached token (never deleting
0213 `$PANDA_CONFIG_ROOT/.token`, which would force an interactive device flow),
0214 parses `jediTaskID=<N>`, and POSTs the outcome to
0215 `/pcs/api/prod-tasks/record-submission/` as the task owner (`X-Remote-User`,
0216 trusted on-host by the localhost tunnel). The task IS submitted even if the
0217 final bookkeeping POST fails; that case is surfaced loudly with the task ID so
0218 the operator can re-record.
0219
0220 The `submit_task` message is published by `services.prodtask_submit_request`,
0221 behind the REST `submit` action (the two-pane compose view, and the task-detail
0222 page's "Submit in Compose" link) — the **external-safe** trigger: a `/pcs/api/`
0223 POST returning JSON. It is gated to `status='ready'` with no existing
0224 `panda_task_id`, mirroring the `record-submission` gates so a submission whose
0225 outcome would be refused is never fired. (The legacy `prod_task_submit_panda`
0226 page-view submit — a page-POST+redirect that 502'd through the swf-remote proxy —
0227 was retired.)
0228
0229 ## Roadmap — capabilities as handlers
0230
0231 Each item below is, by design, a new handler + doer on this agent.
0232
0233 - **Async execution** (above) — implemented; the structural prerequisite for
0234   piling more long-running capabilities onto the singleton.
0235 - **Campaign-provenance sweep** — the join Sakib's catalogue and the
0236   `eic/snippets` `check_campaign.py` / `check_storage.py` do by hand, run live
0237   and credentialed: for each requested EVGEN path
0238   `/volatile/eic/EPIC/EVGEN/<suffix>`, resolve the produced
0239   `epic:/RECO/<campaign>/<detector_config>/<suffix>` DID(s), their RSE replicas
0240   (BNL-XRD / EIC-XRD), and file counts, into a provenance cache the catalog
0241   renders. This supersedes the temporary hand-curated dataset catalogue and the
0242   monthly completion-status email.
0243 - **Proactive completion notification** — auto-notify the operator the moment a
0244   fetch or submit finishes, removing the manual refresh; the corun-ai Mattermost
0245   callback is the working model. Browser-push design: [SSE_PUSH.md](SSE_PUSH.md)
0246   (the agent emits `payload_log_ready` / `prodtask_submitted` over the SSE relay).
0247 - **Credentialed MCP provider** — a future `pcs_prodtask_submit` MCP tool routes
0248   through the agent: the bot triggers, the agent (the credential holder)
0249   executes. The MCP server stays credential-free.
0250 - **OIDC service account** — submit as a robot rather than reusing the
0251   operator's token; and **EVGEN-in-Rucio registration** once that workflow is
0252   defined.
0253
0254 ## Operation
0255
0256 Running, restarting, monitoring, the systemd unit, the cleaner-killer cron
0257 (reap duplicates / liveness / prune), the deliberate-stop back doors, and the
0258 payload-log retrieval mechanics are in [EPICPROD_OPS.md](EPICPROD_OPS.md). This
0259 doc does not duplicate them.
0260
0261 **Status (2026-06-02):** deployed and live on `pandaserver02`. Handlers
0262 `fetch_payload_log`, `submit_task`, `rucio_snapshot_update`, `health_ping`, and
0263 `shutdown` are implemented and the `submit_task` path reuses the operator's
0264 cached production token. Async handler execution is implemented (a `BaseAgent`
0265 worker pool, opt-in `run_in_background`); the three work handlers enqueue their
0266 doers through it.