Back to home page

EIC code displayed by LXR

 
 

    


Warning, /swf-monitor/docs/SSE_PUSH.md is written in an unsupported language. File is not indexed.

0001 # Browser Push for Agent Action Completion (SSE)
0002 
0003 When an ops-agent action finishes — a payload-log fetch, a PanDA submission, and
0004 later the campaign-provenance sweep — the result should appear in the browser on
0005 its own, with no manual refresh and no polling loop. This design delivers that by
0006 reusing the existing SSE relay ([SSE_RELAY.md](SSE_RELAY.md)) with the in-app
0007 browser as a new consumer class.
0008 
0009 The relay today is an outbound, read-only firehose of ActiveMQ workflow messages
0010 to remote headless agents (e.g. a monitor in Japan), over WAN/firewall-friendly
0011 HTTPS. This adds nothing to that pipe. It adds a new *emitter* — the ops agent,
0012 publishing a small completion event when a credentialed action succeeds — and a
0013 new *consumer* — a browser page holding an `EventSource` that updates the DOM the
0014 instant the event arrives. See [EPICPROD_OPS_AGENT.md](EPICPROD_OPS_AGENT.md) for
0015 the agent and its `run_in_background` capability.
0016 
0017 ## Completion events
0018 
0019 On confirmed success, the agent's worker publishes to `/topic/epictopic` — the
0020 topic the monitor's listener consumes and the relay broadcasts — using
0021 `BaseAgent.send_message` (thread-safe from a worker via the send lock):
0022 
0023 | Action | Event | Payload |
0024 |---|---|---|
0025 | `_do_fetch_payload_log` (after `.done`) | `payload_log_ready` | `pandaid`, `jeditaskid` |
0026 | `_do_submit_task` (after record-submission OK) | `prodtask_submitted` | `task_name`, `jedi_task_id` |
0027 
0028 These ride the existing workflow topic rather than a dedicated channel: zero new
0029 relay infrastructure, and the events become a useful ops audit trail as enriched
0030 `WorkflowMessage`s. The event carries the identifying field (`pandaid` /
0031 `task_name`) so a waiting page can recognize its own result.
0032 
0033 ## Relay — unchanged
0034 
0035 Listener consumes `/topic/epictopic` → enriches + persists → publishes to the
0036 Channels group (Redis in prod) → `SSEMessageBroadcaster` → per-client queues →
0037 `/api/messages/stream/`, filtered by `msg_type`. No change here; see
0038 [SSE_RELAY.md](SSE_RELAY.md).
0039 
0040 ## Browser consumer — one pattern, both faces
0041 
0042 A page that has triggered an action opens an `EventSource` filtered to the event
0043 it awaits, e.g. `…/api/messages/stream/?msg_types=payload_log_ready`. On each
0044 event it matches its own `pandaid` / `task_name` in the payload (server-side
0045 filters are by `msg_type`, not per-entity, so the last-mile match is done in JS),
0046 then loads the log or drops in the "PanDA Task N" link.
0047 
0048 **Same template serves both faces.** The `EventSource` URL is written with the
0049 monitor's own `/swf-monitor/` prefix; swf-remote's existing body rewrite turns it
0050 into `/prod/api/messages/stream/` for devcloud automatically. Only the external
0051 proxy *route* is new (below) — the page is identical.
0052 
0053 This replaces the compose panel's 10 s `panda_task_id` poll with the
0054 `prodtask_submitted` event.
0055 
0056 ### Reliability backstop — required
0057 
0058 SSE is best-effort with no replay on reconnect, and the agent can finish before
0059 the browser's `EventSource` has connected (the event is then lost). So a page,
0060 on load, must:
0061 
0062 1. open the `EventSource`,
0063 2. perform **one** immediate status check (catches an event that already fired),
0064 3. keep a slow (~25 s) fallback poll as the correctness net.
0065 
0066 SSE is the live, sub-second path; the immediate check and slow poll exist only so
0067 a missed event cannot strand the user. Heartbeats (~30 s) keep the connection
0068 alive through proxies; `EventSource` reconnects on its own.
0069 
0070 ## External face — swf-remote streaming proxy (new infrastructure)
0071 
0072 `pandaserver02` is inside the BNL perimeter and unreachable by a remote browser,
0073 so the devcloud face goes through the swf-remote proxy on ec2dev. The browser's
0074 `EventSource` is therefore **same-origin to epic-devcloud.org** — there is no
0075 browser CORS. The cross-network hop is swf-remote → monitor over the SSH tunnel.
0076 
0077 The existing `monitor_client.proxy()` cannot carry SSE: it reads the full
0078 response body (`httpx.get`, 30 s timeout) and byte-rewrites it, which an infinite
0079 `text/event-stream` would break. A **dedicated streaming view** is required:
0080 
0081 - `httpx.stream('GET', f'{base}/api/messages/stream/', params=…, headers=…, timeout=None)`
0082   → `StreamingHttpResponse(content_type='text/event-stream')`, yielding chunks
0083   with **no buffering, no body rewrite, no timeout cap**.
0084 - Route: `/prod/api/messages/stream/`.
0085 - Devcloud's Apache must not buffer this response (build-time verification).
0086 
0087 This view is the only new piece for the external face, and it is deployed on
0088 ec2dev (swf-remote is solo-maintained, direct-to-main).
0089 
0090 ## Authentication — no browser CORS anywhere
0091 
0092 - **Internal browser → monitor:** session (CILogon). The SSE endpoint already
0093   accepts session auth.
0094 - **devcloud browser → swf-remote:** Django session, same-origin, gated by login.
0095 - **swf-remote → monitor (the stream hop):** a **service `Token`** on the
0096   upstream request. The SSE endpoint already honors `Authorization: Token`; it
0097   does not currently honor `X-Remote-User`, so the service token is the path of
0098   least change. The devcloud user is still gated by swf-remote's login; the
0099   token authenticates only the trusted proxy hop.
0100 
0101 ## View copy
0102 
0103 While an action is in flight, the triggering view renders "log / task ID will
0104 appear shortly" instead of asking the user to refresh.
0105 
0106 ## Verification order
0107 
0108 1. **Internal pipe:** publish a `payload_log_ready` to `/topic/epictopic`,
0109    confirm a token SSE subscriber receives it (shell round-trip).
0110 2. **Internal browser:** a logged-in monitor page receives the event and updates
0111    live.
0112 3. **External browser:** through the new swf-remote streaming proxy — confirm
0113    prompt delivery (no buffering) and survival past 30 s via heartbeats.
0114 
0115 ## Build order
0116 
0117 1. This design doc.
0118 2. Shared substrate (swf-testbed): agent emits the two completion events; relay
0119    already broadcasts. Internal page gets the `EventSource` + backstop + the view
0120    copy change.
0121 3. External face (ec2dev): the swf-remote streaming proxy view + route, Apache
0122    no-buffer. Authored in the swf-remote clone here, deployed on ec2dev.
0123 
0124 The substrate is shared; the devcloud delta is just the streaming proxy.
0125 
0126 **Status:** implemented. The prod-ops agent publishes the completion events
0127 (`agents/epicprod_ops_agent.py`: `payload_log_ready`, `prodtask_submitted`); the
0128 internal browser pages hold the `EventSource` consumers
0129 (`src/monitor_app/viewdir/pandamon.py`, `src/pcs/templates/pcs/prod_task_compose.html`);
0130 and the swf-remote streaming proxy relays the stream on the external face
0131 (`../swf-remote/src/remote_app/monitor_client.py`, `StreamingHttpResponse`).