source/swf-testbed/RELEASE_NOTES.md

0001 # Release Notes
0002
0003 ## v34 (2026-04-21)
0004
0005 ### Streaming MCP Moved Off mod_wsgi (swf-monitor)
0006
0007 The `/swf-monitor/mcp/` endpoint now runs on a dedicated ASGI worker (uvicorn, `swf-monitor-mcp-asgi.service` on `127.0.0.1:8001`) behind Apache `ProxyPass`. Everything else (`/about/`, `/api/`, `/accounts/login/`, PCS, static files) stays on mod_wsgi.
0008
0009 **Why:** `django-mcp-server` uses Starlette's `StreamableHTTPSessionManager`. Under WSGI, each streaming MCP session holds a thread via `async_to_sync` for the full session lifetime. A handful of concurrent MCP clients (OpenCode, Claude Code CLI, Ollama-backed scripts, python-httpx — any streamable-HTTP MCP client) was enough to saturate the pool and 503 every dynamic URL on the site. Isolating `/mcp/` on an async worker removes that failure mode from the main app.
0010
0011 **What changed operationally:**
0012
0013 - mod_wsgi tuned for burst resilience: `threads=30`, `listen-backlog=500`, `queue-timeout=30`, `inactivity-timeout=300`, `graceful-timeout=15` — no `request-timeout` (would truncate `/api/messages/stream/` SSE long-poll).
0014 - Proxy tuned for streaming: `timeout=3600 keepalive=On disablereuse=On`, `proxy-sendchunked`, `no-gzip`, `CacheDisable` on `/mcp/`.
0015 - `swf-monitor-mcp-asgi.service` systemd unit added (`Restart=always`, 2 uvicorn workers).
0016 - `src/swf_monitor_project/asgi.py` cleaned up — removed dead `mcp_app.routing` import (the module was replaced by the `mcp_server` package long ago; ASGI entrypoint was quietly broken).
0017
0018 ### Apache Config Auto-Sync on Deploy (swf-monitor)
0019
0020 `apache-swf-monitor.conf` in the repo is now the source of truth. `deploy-swf-monitor.sh` diffs it against the live `/etc/httpd/conf.d/swf-monitor.conf` on every deploy; if different, it backs up live, installs from the release, validates with `httpd -t`, and rolls back on failure. The Apache reload that happens every deploy (to recycle mod_wsgi for new Python code) picks up any conf change along with it.
0021
0022 **Why it matters:** there was a 6-week drift — the Mar 11 `dce7abf` fix for MCP IP restriction was committed to the repo but never reached live Apache because nothing copied it. `setup-apache-deployment.sh` regenerated the conf from a hardcoded heredoc (that had drifted from the repo canonical), and `deploy-swf-monitor.sh` didn't touch Apache conf at all. Closed: setup script now `cp`s `apache-swf-monitor.conf` and splits the dynamic `LoadModule` line out to `/etc/httpd/conf.modules.d/20-swf-monitor-wsgi.conf`.
0023
0024 **ASGI worker is also recycled on every deploy** — uvicorn loads code once at startup, so fresh Python code requires a restart. Bots already follow the same pattern (conditional on bot-specific code change).
0025
0026 ### PanDA Mattermost Bot — Multi-Server MCP with Progressive Tool Loading (swf-monitor)
0027
0028 The PanDA bot now orchestrates across **seven external MCP servers** plus the local swf-monitor MCP, selecting tools based on the user's question. New integrations:
0029
0030 - **LXR MCP server** (`github.com/BNLNPPS/lxr-mcp-server`, new this release) — EIC code browser cross-reference. `lxr_ident` (definitions + references), `lxr_search` (ripgrep across repos), `lxr_source` (read source with line numbers), `lxr_list` (browse directories).
0031 - **uproot MCP server** (`github.com/eic/uproot-mcp-server`) — inspect ROOT files: list branches, read arrays, sample contents.
0032 - **JLab-Rucio and BNL-Rucio MCP servers** — query Rucio for EIC datasets, replicas, and rules.
0033 - **GitHub MCP server** — now uses the `epic-capybara` service account with write access for bot-driven automation on EIC repos.
0034 - **epicdoc** — RAG search over ePIC documentation (`epic_doc_search`, `epic_doc_contents`). Runs in-process inside the bot (not as a separate MCP server, not inside WSGI — initial attempt to host it in WSGI brought the monitor down and was moved; see the debugging notes in the 2026-03-31 assessment).
0035
0036 With that many tools, "send the whole catalog to the LLM every turn" stops working. Two new techniques address that:
0037
0038 - **Progressive tool loading via semantic similarity.** For each user question the bot embeds the question and ranks tools by server-prefixed cosine similarity, auto-truncating at a score cliff. The LLM sees a small, relevant slice rather than all hundreds of tools — and the rank is preserved through the display so the LLM can judge relevance.
0039 - **3-tier tool awareness.** Every tool is visible by name + one-line catalog entry in the system prompt, so the LLM knows the full surface area exists at minimal token cost. Detailed schemas are fetched only for tools the LLM explicitly selects via `select_tools`. Server and suggestion context carries forward across thread turns, so follow-ups don't re-select from scratch.
0040
0041 **Other bot improvements:**
0042
0043 - **System prompt externalized** to `monitor_app/panda/system_prompt.txt` and re-read on every message — prompt iteration no longer requires a bot restart.
0044 - **DPID detection hardened.** For job/task questions the bot verifies that any Data Provenance ID in the reply came from actual tool output before letting it through. Detection is now line-based and format-agnostic; trigger word **AND** a matching ID must both be present.
0045 - **Bamboo log analysis** integrated into `panda_study_job` for failed jobs — surfaces Harvester pilot-log analysis automatically when filebrowser lookup fails. Exposed to the LLM via an explicit `log_analysis` field the bot is instructed to surface.
0046 - **Response style rules** in the system prompt curb overenthusiastic replies (e.g., verbose explanations when a one-line answer suffices).
0047 - Server-side matplotlib plot rendering, nightly cron scripts to auto-update each MCP server repo.
0048
0049 ### New swf-monitor MCP Tool: `panda_harvester_workers`
0050
0051 Live Harvester pilot/worker counts via bamboo's `askpanda_atlas`. Useful for "what pilots are running right now?" without needing to grep through Harvester logs.
0052
0053 ```python
0054 panda_harvester_workers(status='running', site='NERSC', resourcetype='SCORE', days=1)
0055 ```
0056
0057 Returns totals plus breakdown by status, site, and resourcetype. Clean, LLM-friendly response format.
0058
0059 ### PCS — Compose UX Polish + Programmatic Submission Path (swf-monitor)
0060
0061 **Compose pages (Physics/EvGen/Simu/Reco tags, Datasets, Prod Configs, Prod Tasks):**
0062
0063 - Uniform button styling — all filled (solid) variants, dark-green accent on live edited values, consistent New-button placement in the left panel across all compose views.
0064 - Breadcrumbs and Cancel buttons point to compose views instead of the legacy list views.
0065 - Name-based URL params so compose views are bookmarkable and deep-linkable.
0066 - Owner-only edit enforcement on production configs (same discipline as tag edits).
0067 - Edit / Copy / New buttons no longer silently fail on prod config compose (previous type-argument mismatch fixed).
0068 - Compose panels for `command` and `taskParamMap` grow to fit content instead of forcing horizontal scroll.
0069 - Fixed type-argument mismatch in compose URL sync.
0070
0071 **Production Tasks — submission artifacts:**
0072
0073 A single read-only endpoint regenerates a task's submission artifact from current PCS state on every call (no DB writes):
0074
0075 ```
0076 GET /swf-monitor/pcs/api/prod-tasks/command/?name=<task_name>&fmt=<format>
0077 ```
0078
0079 | `fmt` | Contents |
0080 |-------|----------|
0081 | `condor` | env-prefixed `submit_csv.sh` command |
0082 | `panda` | `prun` command |
0083 | `jedi` | `taskParamMap` for `Client.insertTaskParams()` |
0084 | `dump` | Full view: task + dataset + all four tags + prod config + effective config |
0085
0086 The parameter is `fmt` because DRF reserves `format` for its own content-negotiation.
0087
0088 **New CLI `pcs-task-cmd`** — stdlib-only Python client over that endpoint. The recommended way for production operators and automation to fetch submission artifacts (no Django import, no DB credentials):
0089
0090 ```bash
0091 # Inspect a task
0092 pcs-task-cmd <task_name> --format dump
0093
0094 # Submit to JEDI (requires valid PanDA auth)
0095 pcs-task-cmd <name> --format jedi | python -c '
0096 import json, sys
0097 from pandaclient import Client
0098 print(Client.insertTaskParams(json.load(sys.stdin)))
0099 '
0100
0101 # Pipe Condor command into bash
0102 eval "$(pcs-task-cmd <name> --format condor)"
0103 ```
0104
0105 Environment: `SWFMON_URL` (default `https://epic-devcloud.org/prod`), optional `SWFMON_TOKEN` for non-public deployments.
0106
0107 **JEDI taskParamMap now surfaced on task detail** — `build_task_params()` renders the full param map users will submit, viewable and copyable directly from the compose page.
0108
0109 ### Deploy-Script Improvements (swf-monitor)
0110
0111 - **`swf-monitor-mcp-asgi.service` restart step** — always restarts on deploy (uvicorn needs it).
0112 - **Apache conf sync** — described above.
0113 - **Shared HuggingFace cache** — `deploy-swf-monitor.sh` ensures `/opt/swf-monitor/shared/hf_cache` exists with open perms and appends `HF_HOME=` to `production.env` if missing. Bamboo and epicdoc reuse the cache across processes.
0114 - **Bot restarts after health check, not before** — avoids killing bots mid-request if Apache comes up broken.
0115 - **Nightly cron** (`nightly-update-mcp-servers.sh`, `nightly-update-epicdoc.sh`) — auto-updates sibling MCP-server repos and re-ingests ePIC documentation into epicdoc's ChromaDB store.
0116
0117 ### PanDA Production Monitoring — Job Deep-Dive Enhancements (swf-monitor)
0118
0119 - **NERSC portal log URLs** surfaced for Perlmutter jobs in `panda_study_job` — clickable links to the NERSC job portal alongside existing Harvester log URLs.
0120 - **Bamboo log analysis** runs on failed jobs automatically; LLM-friendly `log_analysis` field with fallback to Harvester URL when filebrowser fails.
0121 - **Error field rename** in `/panda job` output (source → component) — fixes a KeyError that surfaced on some job records.
0122
0123 ### Auth & API Changes (swf-monitor)
0124
0125 - **`TunnelAuthMiddleware`** now requires an `X-Remote-User` header before auto-authenticating — anonymous proxy requests no longer get a free pass. Matches the threat model of the TunnelAuthentication DRF backend (also checks the header before acting).
0126 - **`/api/users/`** response now includes `email`, `first_name`, `last_name` — enables richer devcloud account sync.
0127
0128 ### Documentation
0129
0130 - **`PRODUCTION_DEPLOYMENT.md`** refreshed for the two-backend layout, new setup-apache-deployment.sh behavior, and the full deploy step list (conf sync, ASGI worker restart).
0131 - **`MCP.md`** — ASGI/WSGI split documented, transport description corrected (it IS streamable HTTP), tool summary count corrected to 44, all tool categories added.
0132 - **`PCS.md`** — MCP Tools table corrected to the tools that actually exist.
0133 - **JEDI design docs** added: `JEDI_INTEGRATION.md` (architecture, field mapping, implementation plan) and `JEDI_EPIC_PROPOSAL.md` (technical proposal for PanDA team review) — roadmap for direct task submission to JEDI replacing the current `prun` CLI text generation.
0134
0135 ### Agent Resilience (swf-common-lib)
0136
0137 Further hardening of the BaseAgent lifecycle under unreliable infrastructure:
0138
0139 - **Agent-ID registration retries indefinitely** on API failure (previously gave up after a bounded number of attempts). Agents starting into a partially-up monitor no longer silently fail to register.
0140 - **Improved resilience to server restarts** — agents survive transient monitor outages and resume their heartbeat loop cleanly on reconnection.
0141
0142 ### swf-testbed — Upstream Contributions Integrated
0143
0144 Several contributions landed direct-to-main during and just before the v34 cycle that were not acknowledged in earlier release notes. They are part of main as of this release. With thanks:
0145
0146 **Agent code consolidation — Dmitry Kalinkin (PR #35, #36)**
0147
0148 Unified agent code into the `swf-testbed` repository:
0149
0150 - **PR #35 "Import SOTA agents"** — imports `agents/data_agent.py` and `agents/processing_agent.py` with full git history from the sibling repositories `BNLNPPS/swf-data-agent` and `BNLNPPS/swf-processing-agent`. Supersedes the shell of earlier example agents with BaseAgent-derived implementations (Rucio / XRootD integration, MQ handlers, dataset lifecycle).
0151 - **PR #36 "Delete superseded agents"** — final cleanup once the unified `agents/` package stabilized: removes `example_agents/daq_simulator_superseded.py`, `example_agents/example_daqsim_agent_superseded.py`, and `example_agents/processing_agent.py`.
0152
0153 **Prompt-processing workflow — Zhaoyu Yang (PR #37, #38)**
0154
0155 A new streaming workflow for prompt processing of time-frame slices, built on top of Dmitry's imported agents package:
0156
0157 - `agents/prompt_processing_agent.py` — new agent for the prompt-processing pipeline
0158 - `workflows/prompt_processing.py`, `workflows/prompt_processing.toml`, `workflows/prompt_processing_default.toml` — workflow definition and default config
0159 - Orchestrator wiring in `workflows/orchestrator.py`; supervisord entry in `agents.supervisord.conf`
0160 - `scripts/dummy_stf_processing.sh` — placeholder payload for development
0161 - Refactor updates to `agents/data_agent.py` supporting the new flow
0162 - Documentation: `docs/prompt-processing-workflow.md`, architecture image `docs/images/prompt-processing-workflow.png`, `docs/skills-for-testbed.md`
0163
0164 **CRIC endpoint / queue-config expansions — Xin Zhao (PR #34)**
0165
0166 - `config/ddm_endpoints.json` — substantial DDM endpoint additions (+465 lines)
0167 - `config/panda_queues.json` — PanDA queue config additions (+1030 lines)
0168 - Reflects updated CRIC-sourced site/endpoint data for ePIC production
0169
0170 ### swf-testbed — Baseline Branch Work
0171
0172 No user-facing changes on the `infra/baseline-v34` branch itself — administrative commits only (CLAUDE.md branch-reference updates, v33 release notes catch-up, v34 release notes including this acknowledgments section).
0173
0174 ---
0175
0176 ## v33 (2026-03-29)
0177
0178 ### Dual-Mode UI: ePIC Production / ePIC Testbed (swf-monitor)
0179
0180 The monitor now operates in two modes, selectable via a nav bar toggle (localStorage-persisted):
0181
0182 - **ePIC Production** (`/prod/`) — PanDA production monitoring (activity, jobs, tasks, errors, diagnostics, queues) + PCS (tags, datasets, prod configs, prod tasks). Shared PCS sections template keeps PCS hub and production hub in sync.
0183 - **ePIC Testbed** (`/testbed/`) — Streaming workflow testbed: workflows, time frame data, agents, messaging, system state, PanDA/Rucio.
0184
0185 Root URL redirects based on mode. About page updated for dual-mode, all access methods, tech stack.
0186
0187 ### PanDA Production Pages (swf-monitor)
0188
0189 Full DataTables views for **Activity, Jobs, Tasks, Errors, Diagnostics**. **EIC PanDA Queues** from live schedconfig with MCP tools (`panda_list_queues`, `panda_get_queue`). **`panda_resource_usage`** for allocated vs used core-hours. **`panda_study_job`** for deep single-job analysis. **`destinationse`** (destination storage element) from filestable4 added to job listings and error summary. PanDA query modules refactored into `constants.py`, `sql.py`, `queries.py`. Monitor links point to epic-devcloud.org.
0190
0191 ### PCS Auth & Proxy Support (swf-monitor)
0192
0193 Full PCS functionality through the swf-remote (epic-devcloud.org) proxy:
0194
0195 - **`TunnelAuthentication`** DRF backend — authenticates localhost/tunnel requests via `X-Remote-User` header without CSRF enforcement
0196 - **`IsAuthenticatedOrReadOnly`** on all PCS API viewsets — anonymous GET, auth required for writes
0197 - **`created_by` from `request.user`** — read-only in serializers, set server-side
0198 - **Tag delete API** — `POST /delete/` with creator-only, draft-only enforcement
0199 - **All PCS templates** converted from form POST to JS fetch → REST API
0200 - **`/api/users/`** endpoint with password hash for devcloud account sync
0201
0202 ### Mattermost PanDA Bot (swf-monitor)
0203
0204 - **4 MCP server types**: HTTP (PanDA, PCS), stdio (XRootD, GitHub, Zenodo)
0205 - **DPID (Data Provenance ID)** anti-fabrication: bot verifies LLM cited a real DPID, strips from user reply, warns if verification fails
0206 - **`/panda` slash commands** — status, errors, jobs/tasks with status filter and pagination, job/task detail, sites, site detail, help
0207 - **`bot_manage_servers`** virtual tool — list with versions, update/rebuild/restart
0208 - **Server-side matplotlib plots** in Mattermost
0209 - System prompt: data integrity rules, security rules, "never ask user to look something up"
0210
0211 ### MCP Servers
0212
0213 - **Zenodo** (`eic/zenodo-mcp-server`) — search, inspect, download from zenodo.org
0214 - **XRootD** (`eic/xrootd-mcp-server`) — file browsing and reading on JLab XRootD
0215 - **GitHub** (`github/github-mcp-server`) — read-only repo, issue, PR, actions access
0216 - **StdioMCPClient** transport for managing external MCP server subprocesses
0217
0218 ### Agent Resilience (swf-common-lib, swf-testbed)
0219
0220 - API retry with exponential backoff (swf-common-lib)
0221 - Agent manager: supervisord health verification, SIGUSR1 heartbeat, exit heartbeat on shutdown
0222 - check-testbed skill and supervisord health monitoring
0223 - AI memory hooks for cross-session dialogue persistence
0224
0225 ### Bug Fixes
0226
0227 - Namespace datatable: `Count('id')` on model without `id` field
0228 - `list_tasks`: stale filter params misaligned with where clauses
0229 - Django 5+ logout requires POST
0230 - Workflow parameter override: auto-discover all config sections
0231
0232 ## v32 (2026-03-02)
0233
0234 ### PCS (Physics Configuration System) — New Django App (swf-monitor)
0235
0236 A new Django app for configuring production tasks based on physics inputs for ePIC Monte Carlo simulation campaigns. PCS organizes configurations as tags — named parameter sets for each stage of the MC pipeline:
0237
0238 - **Physics tags (p):** process, beam energies, species, Q2 range
0239 - **EvGen tags (e):** event generator and version
0240 - **Simu tags (s):** detector simulation config
0241 - **Reco tags (r):** reconstruction config
0242
0243 Tags have a draft/locked lifecycle. Locked tags are immutable and used in production.
0244
0245 **Tag compose UI:** Split-panel interface for browsing, creating, editing, copying, and locking tags. Arrow key navigation, parameter filter dropdowns, inline editing with suggestion bars, predicted tag numbering, and diff highlighting for edits. Generalized for all four tag types with category-conditional fields.
0246
0247 **Seeded data:** `seed_campaign_tags` management command creates 64 tags from the 26.02.0 campaign (47 physics, 15 evgen, 1 simu, 1 reco).
0248
0249 **MCP tools:** `pcs_list_tags`, `pcs_get_tag`, `pcs_search_tags`.
0250
0251 ### PanDA Mattermost Bot (swf-monitor)
0252
0253 Claude-based production monitoring chatbot in Mattermost. Listens in the `#pandabot` channel, answers questions using Claude Haiku with tool use.
0254
0255 - Discovers tools from MCP server automatically
0256 - System prompt built from MCP server instructions, stays in sync with deployed tool documentation
0257 - Supports PanDA and PCS tools
0258 - Thread-aware conversations
0259
0260 ### PanDA Web Monitor (swf-monitor)
0261
0262 New web views for ePIC-focused PanDA production monitoring:
0263
0264 - Activity overview, job list, task list, job detail, task detail, error summary, job diagnostics
0265 - Cross-linking, days selector, server-side DataTables, colored status badges
0266 - Shares data layer with MCP tools via factored `panda/` package (`constants.py`, `sql.py`, `queries.py`)
0267
0268 ### PanDA MCP Tools — New and Enhanced (swf-monitor)
0269
0270 Six new tools for PanDA production monitoring via MCP:
0271
0272 - `panda_list_jobs` — job overview with summary stats, cursor-based pagination
0273 - `panda_list_tasks` — JEDI task monitoring with workinggroup/processingtype filters
0274 - `panda_get_activity` — pre-digested activity overview (aggregate counts, no individual records)
0275 - `panda_error_summary` — aggregate error ranking across failed jobs
0276 - `panda_diagnose_jobs` — failed job diagnostics with all 7 error component fields
0277 - `panda_study_job` — deep single-job analysis (~40 fields, filestable, condor logs, structured errors)
0278
0279 ### MCP Infrastructure (swf-monitor)
0280
0281 - Refactored monolithic `mcp.py` (2,544 lines) into `mcp/` package
0282 - AI memory model and REST API for cross-session dialogue persistence
0283 - Fixed `_get_username()`: use SWF_HOME directory ownership instead of `getpass.getuser()` (returns 'apache' under WSGI)
0284 - Fixed fastmon-files API to accept STF filename string instead of requiring UUID
0285 - Added Bootstrap 5 CSS
0286
0287 ### Documentation Cleanup
0288
0289 Deleted 9 stale or superseded files across both repos (1,800+ lines removed): old monolithic README backup, abandoned design docs, failed procedure docs, one-time reports, broken index pages. Fixed hardcoded credentials in installation guide, dead links, malformed markdown, and updated CLAUDE.md branch reference to v32.
0290
0291 ### swf-common-lib
0292
0293 No changes in v32.
0294
0295 ---
0296
0297 ## v31 (2026-02-18)
0298
0299 ### Robustness Improvements for LLM-driven Testbed Controls
0300
0301 Hardened the MCP control path so AI agents can reliably start, monitor, and manage testbed workflows without misinterpreting system state.
0302
0303 **Testbed status fixes (swf-monitor):**
0304 - `ready` field now checks running workflow executions, not agent count — was permanently false when agents were idle after a completed workflow
0305 - REST heartbeat no longer overwrites `workflow_enabled` to false on every heartbeat
0306 - `start_workflow` namespace resolution falls back to running agent manager's namespace when env var unavailable in Apache context
0307 - `start_user_testbed` no longer destroys the agent manager on every start
0308 - Surfaced supervisord health and agent manager errors in MCP status tools
0309 - Fixed MCP username resolution: use SWF_HOME directory ownership, require explicit username parameter
0310
0311 **Agent manager hardening (swf-testbed):**
0312 - Verify supervisord health, check agent starts, log errors instead of failing silently
0313 - SIGUSR1 heartbeat refresh after check-testbed fixes
0314 - Exit heartbeat on shutdown so DB immediately reflects agent manager death
0315 - check-testbed skill for bootstrapping infrastructure
0316 - Fixed workflow parameter override to auto-discover all config sections
0317
0318 **Workflow monitoring guidance:**
0319 - MCP docs now instruct AI to actively poll `swf_get_workflow_monitor` during execution rather than sleeping
0320
0321 **Other:**
0322 - AI memory hooks and documentation for cross-session dialogue persistence
0323 - Refactored monolithic mcp.py into package (system, workflows, ai_memory, common)
0324
0325 ## v30 (2026-02-03)
0326
0327 ### Auth0 OAuth 2.1 Authentication for Claude.ai MCP
0328
0329 Added secure OAuth 2.1 authentication for remote MCP connections from Claude.ai, using [Auth0](https://auth0.com/) as the identity provider.
0330
0331 **How it works:**
0332 1. Claude.ai discovers OAuth metadata via `/.well-known/oauth-protected-resource`
0333 2. User authenticates with Auth0 (redirected to Auth0's login page)
0334 3. Auth0 issues JWT access token to Claude.ai
0335 4. Claude.ai includes Bearer token in MCP requests
0336 5. Django middleware validates JWT against Auth0's JWKS endpoint
0337
0338 **Configuration:**
0339 ```bash
0340 AUTH0_DOMAIN=your-tenant.us.auth0.com
0341 AUTH0_CLIENT_ID=your-client-id
0342 AUTH0_CLIENT_SECRET=your-client-secret
0343 AUTH0_API_IDENTIFIER=https://your-server/swf-monitor/mcp
0344 ```
0345
0346 **Access modes:**
0347 - **Claude.ai (remote)**: Requires OAuth authentication via Auth0
0348 - **Claude Code (local)**: POST requests pass through without auth for local development
0349
0350 **Network requirement:** Claude.ai connects from Anthropic's servers, so the MCP endpoint must be accessible from the public internet.
0351
0352 ### MCP Tool Naming Convention
0353
0354 Renamed all 29 MCP tools with `swf_` service prefix for multi-server discovery:
0355 - `list_agents` → `swf_list_agents`
0356 - `get_system_state` → `swf_get_system_state`
0357 - etc.
0358
0359 This follows MCP best practices for environments where multiple MCP servers are connected. The prefix enables clean tool discovery and avoids naming collisions.
0360
0361 Reference: https://www.philschmid.de/mcp-best-practices
0362
0363 ### Pagination Metadata for List Tools
0364
0365 All list tools now return pagination metadata to help LLMs manage context:
0366
0367 ```json
0368 {
0369   "items": [...],
0370   "total_count": 1523,
0371   "has_more": true,
0372   "monitor_urls": [...]
0373 }
0374 ```
0375
0376 - `total_count`: Total matching records in database
0377 - `has_more`: Boolean indicating results are truncated
0378
0379 This helps LLMs understand when query results are incomplete and whether to refine filters.
0380
0381 ### New MCP Tool: swf_send_message
0382
0383 Send messages to the workflow monitoring stream:
0384
0385 ```python
0386 swf_send_message(
0387     message="Test message",
0388     message_type="announcement",  # or "test", custom types
0389     metadata={"key": "value"}     # optional
0390 )
0391 ```
0392
0393 Use cases:
0394 - Testing the message pipeline end-to-end
0395 - Sending announcements to colleagues monitoring the stream
0396 - Debugging SSE relay functionality
0397
0398 ### Message Type Standardization
0399
0400 Standardized on `stf_ready` message type across all agents. Previously some agents used `data_ready` inconsistently. Updated `WORKFLOW_MESSAGE_TYPES` in swf-common-lib and all example agents. Also added `tf_file_registered` to the canonical message types.
0401
0402 ### Bug Fixes
0403
0404 - **Fixed monitor URLs in MCP responses**: Tool responses were returning localhost URLs instead of production URLs. Now correctly returns URLs based on deployment configuration.
0405
0406 ### Documentation
0407
0408 - Updated `docs/MCP.md` with all swf_ prefixed tool names
0409 - Documented Auth0 OAuth 2.1 configuration and flow
0410 - Added pagination metadata documentation
0411 - Documented ActiveMQ connection patterns and messaging semantics
0412 - Noted that `.env` files are not deployed from git (must be configured on server)
0413 - **CLAUDE.md overhaul**: Streamlined per Anthropic best practices, added operational guidelines
0414 - **MCP tool change guidance**: When adding/modifying MCP tools, must update `swf_list_available_tools()` hardcoded list in mcp.py
0415
0416 ---
0417
0418 ## v29 (2026-01-25)
0419
0420 ### Per-User Configuration Override (SWF_TESTBED_CONFIG)
0421
0422 A new environment variable `SWF_TESTBED_CONFIG` enables per-user configuration overrides across all core repositories. This allows multiple users to run their own testbed instances with different configurations on the same system.
0423
0424 **Usage:**
0425 ```bash
0426 export SWF_TESTBED_CONFIG=/path/to/my-testbed.toml
0427 testbed run  # Uses your custom config instead of workflows/testbed.toml
0428 ```
0429
0430 This is supported in swf-testbed, swf-monitor (MCP tools), and swf-common-lib (BaseAgent).
0431
0432 ### Agent Manager Enhancements
0433
0434 The user agent manager daemon introduced in v28 has been significantly improved:
0435
0436 - **Config-driven namespace and agent selection**: The agent manager now reads namespace and agent configuration from testbed.toml, enabling different users to run different agent sets
0437 - **REST logging**: Agent manager logs are now sent to swf-monitor for centralized viewing via `list_logs()`
0438 - **Restart command**: New `restart` command for reloading configuration without full stop/start cycle
0439 - **Immediate heartbeat**: Agent manager sends heartbeat immediately on startup, not after the first interval
0440 - **Clean disconnect**: Proper cleanup on restart prevents stale connection state
0441 - **Venv path handling**: Improved virtual environment path resolution
0442
0443 ### New MCP Tool: get_testbed_status
0444
0445 A comprehensive status tool that combines agent manager, namespace, and workflow agent information in a single call.
0446
0447 ```python
0448 get_testbed_status(username='wenauseic')
0449 ```
0450
0451 Returns:
0452 - Agent manager status (alive, namespace, control queue)
0453 - Summary of running/stopped agents
0454 - List of all workflow agents with current state
0455
0456 This replaces the need to call multiple tools to understand testbed readiness.
0457
0458 ### MCP Improvements
0459
0460 - **SWF_TESTBED_CONFIG support**: MCP tools respect the per-user config override
0461 - **start_user_testbed safety check**: Refuses to start if workflow agents are already running - user must call stop_user_testbed first to ensure clean state
0462 - **Log filtering fixes**: Multiple fixes to username extraction in log list views - now correctly filters by the username segment in agent instance names
0463 - **Heartbeat API fix**: The heartbeat endpoint now properly updates operational_state, pid, and hostname fields
0464 - **monitor_urls in responses**: MCP tool responses include links to relevant monitor UI pages
0465
0466 ### Documentation
0467
0468 New architectural documentation with SVG diagrams:
0469 - **docs/agent-management.md**: Agent lifecycle, supervisord integration, agent manager architecture
0470 - **docs/fast-processing-workflow.md**: Fast processing pipeline, TF slice workflow, worker coordination
0471 - **5 SVG diagrams**: Visual architecture diagrams for agent management and fast processing
0472
0473 Updated MCP documentation with Claude Code settings examples and query best practices.
0474
0475 ### Signal Handlers (swf-common-lib)
0476
0477 BaseAgent now includes signal handlers for SIGTERM and SIGINT, enabling cleaner shutdown behavior when agents are terminated by supervisord or manually.
0478
0479 ---
0480
0481 ## v28 (2026-01-13)
0482
0483 ### ActiveMQ Destination Prefix Requirement (Breaking Change)
0484
0485 **All ActiveMQ destinations now require explicit `/queue/` or `/topic/` prefix.** This is a breaking change that affects all agent code sending messages.
0486
0487 **Before (incorrect):**
0488 ```python
0489 self.send_message('epictopic', message)  # WRONG - bare name
0490 ```
0491
0492 **After (correct):**
0493 ```python
0494 self.send_message('/topic/epictopic', message)  # Correct - explicit prefix
0495 ```
0496
0497 **Why this matters:**
0498 - Bare destination names were ambiguous - ActiveMQ behavior depends on broker configuration
0499 - Explicit prefixes make the routing intention clear: `/queue/` for anycast (one consumer) vs `/topic/` for multicast (all consumers)
0500 - BaseAgent now validates destination format and raises `ValueError` for bare names
0501
0502 Existing code using bare names will fail immediately with a clear error message explaining the required format. All example agents and workflow code have been updated.
0503
0504 ### MCP Workflow Control - AI-Driven Operations
0505
0506 The MCP service now provides **full workflow control**, enabling AI assistants to start, stop, and monitor workflows without requiring CLI access. This is the key enabler for AI-driven testbed operations.
0507
0508 **New workflow control tools:**
0509 - `start_workflow` - Start a workflow by sending a command to the DAQ Simulator agent. All parameters are optional; defaults are read from the user's `testbed.toml`. Override specific parameters (e.g., `stf_count=5`) while inheriting others from config.
0510 - `stop_workflow` - Stop a running workflow gracefully by execution_id. The workflow stops at the next checkpoint.
0511 - `end_execution` - Mark a stuck execution as terminated in the database. Use this to clean up stale executions that the agent can no longer reach.
0512
0513 **New agent management tools:**
0514 - `kill_agent` - Send SIGKILL to an agent process by instance name. Looks up the agent's PID and hostname, kills if on the same host, and always marks the agent as EXITED in the database.
0515
0516 **New monitoring tools:**
0517 - `get_workflow_monitor` - Aggregated view of workflow execution: status, phase, STF count, key events, and errors (from both messages and logs). Single-call alternative to polling multiple tools.
0518 - `list_workflow_monitors` - List recent executions (last 24h) that can be monitored.
0519
0520 The MCP tool count has grown from 20 to **27 tools**. Documentation in `swf-monitor/docs/MCP.md` has been updated to reflect all tools.
0521
0522 ### User Agent Manager - Per-User Testbed Control via MCP
0523
0524 A new **agent manager daemon** enables MCP-driven control of per-user testbed agents. This allows AI assistants to start and stop a user's testbed without requiring SSH or terminal access.
0525
0526 **Architecture:**
0527 - Each user runs a lightweight `testbed agent-manager` daemon in their swf-testbed directory
0528 - The daemon listens on a user-specific queue (`/queue/agent_control.<username>`) for commands
0529 - It manages supervisord-controlled agents and reports status via heartbeats
0530
0531 **New MCP tools:**
0532 - `check_agent_manager(username)` - Check if a user's agent manager is alive. Returns heartbeat status, control queue name, and whether agents are running.
0533 - `start_user_testbed(username, config_name)` - Send start command to agent manager. Agents start asynchronously.
0534 - `stop_user_testbed(username)` - Send stop command to agent manager.
0535
0536 **Usage:**
0537 ```bash
0538 # Start the agent manager daemon (run once, keeps running)
0539 cd /data/<username>/github/swf-testbed
0540 source .venv/bin/activate && source ~/.env
0541 testbed agent-manager
0542 ```
0543
0544 Then an AI assistant can:
0545 1. Check readiness: `check_agent_manager(username='wenauseic')`
0546 2. Start testbed: `start_user_testbed(username='wenauseic')`
0547 3. Run workflows: `start_workflow()`
0548 4. Stop when done: `stop_user_testbed(username='wenauseic')`
0549
0550 ### Persistent WorkflowRunner with Message-Driven Execution
0551
0552 The WorkflowRunner agent has been redesigned as a **persistent, message-driven service** rather than a one-shot script.
0553
0554 **Key changes:**
0555 - WorkflowRunner now starts with supervisord and listens on `/queue/workflow_control` for commands
0556 - Commands include `run_workflow` (from MCP `start_workflow`) and `stop_workflow`
0557 - Each execution gets a unique `execution_id` (e.g., `stf_datataking-wenauseic-0044`)
0558 - The `stop_workflow` command targets a specific execution by ID, enabling graceful termination
0559
0560 **Why this matters:**
0561 - The WorkflowRunner is always ready to receive workflow commands - it doesn't need to be started for each run
0562 - This models the actual ePIC system more realistically, where the DAQ system is a persistent service
0563 - Workflows can be started and stopped via MCP without CLI access
0564 - Multiple workflows can be managed by execution_id
0565
0566 ### Enhanced get_system_state - User Context and Readiness
0567
0568 The `get_system_state` MCP tool now accepts a `username` parameter and provides user-specific context.
0569
0570 **New fields returned:**
0571 - `user_context` - Namespace and workflow defaults from user's `testbed.toml`
0572 - `agent_manager` - Status of user's agent manager daemon (healthy/unhealthy/missing/exited)
0573 - `workflow_runner` - Status of DAQ Simulator in user's namespace
0574 - `ready_to_run` - Boolean indicating if the user can start a workflow
0575 - `last_execution` - Most recent workflow execution in user's namespace
0576 - `errors_last_hour` - Count of ERROR logs in user's namespace
0577
0578 This enables AI assistants to answer questions like "Am I ready to run a workflow?" with a single call.
0579
0580 ### EXITED Status and Agent Lifecycle
0581
0582 Improved agent lifecycle management with explicit EXITED status handling.
0583
0584 **Changes:**
0585 - Agents now set `status='EXITED'` and `operational_state='EXITED'` on clean shutdown
0586 - `list_agents` **excludes EXITED agents by default** - use `status='EXITED'` to see only exited, or `status='all'` to see all
0587 - `kill_agent` always marks agents as EXITED, even if the kill fails
0588 - EXITED agents don't clutter the active agent list but remain queryable for debugging
0589
0590 **Migration:** A database migration (`0014_systemagent_exited_status.py`) adds the EXITED choice to the status field.
0591
0592 ### Logging Context with execution_id
0593
0594 Improved log traceability with execution context in log records.
0595
0596 **Changes:**
0597 - New `_log_extra()` helper in BaseAgent returns consistent extra fields: `username`, `execution_id`, `run_id`
0598 - All agent log calls should use: `logger.info("message", extra=self._log_extra())`
0599 - `list_logs` MCP tool now supports `execution_id` parameter to filter logs by workflow execution
0600
0601 **Usage:**
0602 ```python
0603 # In agent code
0604 self.logger.info("Processing STF", extra=self._log_extra())
0605
0606 # Via MCP
0607 list_logs(execution_id='stf_datataking-wenauseic-0044')
0608 ```
0609
0610 This enables tracing all log messages for a specific workflow execution, essential for debugging workflow failures.
0611
0612 ### Monitor UI Improvements
0613
0614 **Log detail page:** The log detail view (`/logs/<id>/`) now displays the `extra_data` JSON field when present. This shows execution context (execution_id, run_id, namespace, username) that agents include via `_log_extra()`. Previously this context was captured but not visible in the UI.
0615
0616 **Log list filtering:** The log list now supports filtering by execution_id, complementing the existing app_name, instance_name, and level filters.
0617
0618 ### Documentation Updates
0619
0620 - **MCP.md** completely rewritten to document all 27 tools with accurate parameters and return values
0621 - Removed "Not Yet Implemented" section - all documented tools are now functional
0622 - Added sections for Workflow Control, Agent Management, User Agent Manager, and Workflow Monitoring
0623 - Updated tool count from 20 to 27
0624
0625 ---
0626
0627 ## v27 (2026-01-08)
0628
0629 ### MCP Integration
0630
0631 The swf-monitor now exposes a **Model Context Protocol (MCP)** API, enabling AI assistants like Claude to query and interact with the testbed system.
0632
0633 **20+ MCP tools** for:
0634 - **System state**: `get_system_state`, `list_agents`, `get_agent`, `list_namespaces`
0635 - **Workflows**: `list_workflow_definitions`, `list_workflow_executions`, `get_workflow_execution`
0636 - **Data**: `list_runs`, `get_run`, `list_stf_files`, `get_stf_file`, `list_tf_slices`
0637 - **Messages & Logs**: `list_messages`, `list_logs`, `get_log_entry`
0638
0639 **Auto-discovery**: Add `.mcp.json` to your project root for Claude Code to automatically connect:
0640 ```json
0641 {
0642   "mcpServers": {
0643     "swf-testbed": {
0644       "type": "sse",
0645       "url": "https://pandaserver02.sdcc.bnl.gov/swf-monitor/mcp/"
0646     }
0647   }
0648 }
0649 ```
0650
0651 **Endpoint**: `https://pandaserver02.sdcc.bnl.gov/swf-monitor/mcp/`
0652
0653 ### Agent Lifecycle Management
0654
0655 Agents now report process information for lifecycle management:
0656 - **pid**: Process ID for kill operations
0657 - **hostname**: Host where agent is running
0658 - **operational_state**: STARTING → READY → PROCESSING → EXITED
0659
0660 These fields enable future orchestration features like agent health monitoring and remote termination.
0661
0662 ### Database Logging
0663
0664 New `DbLogHandler` sends Python log records to the monitor database, enabling centralized log viewing:
0665 - View logs via monitor UI at `/logs/`
0666 - Filter by app, instance, level, time range
0667 - Query via MCP: `list_logs(level='ERROR')`, `get_log_entry(log_id)`
0668
0669 ### BaseAgent Improvements
0670
0671 - Agents report EXITED status on shutdown
0672 - Warning logged when sending messages without namespace set
0673 - Heartbeats include pid, hostname, operational_state
0674
0675 ---
0676
0677 ## v26 (2025-12-31)
0678
0679 ### Namespaces
0680
0681 Workflows now operate within **namespaces**, allowing users to isolate their work from others sharing the same infrastructure.
0682
0683 On shared systems like pandaserver02, multiple users can run workflows simultaneously. Namespaces let you filter the monitor UI to see only your workflows, agents, and messages, and avoid conflicts with other users.
0684
0685 Configure your namespace in `workflows/testbed.toml` before running any workflows:
0686
0687 ```toml
0688 [testbed]
0689 namespace = "your-namespace"  # e.g., "alice-dev", "team-fastmon"
0690 ```
0691
0692 All workflow messages now include the namespace, and the monitor UI provides namespace filtering on agents, executions, and messages.
0693
0694 ### Monitor UI
0695
0696 - **Namespace pages**: List and detail views; namespace column and filter on agents, executions, messages
0697 - **Agent list**: Type and status filters; click agent to see detail
0698 - **Agent detail**: Streamlined view linking to filtered workflow messages
0699 - **Workflow messages**: execution_id and run_id filters; STF count column; click for message detail
0700 - **Message detail**: Full message content view
0701 - **Drill-down links**: Click execution_id, run_id, namespace, or agent anywhere to navigate to details
0702 - **Source links**: GitHub links on workflow definition (branch) and execution (commit) pages
0703
0704 ### Workflow Refinements
0705
0706 **Count-based workflow completion:** Workflows can now run until a specific number of STF files are generated, rather than requiring a duration limit:
0707
0708 ```bash
0709 python workflows/workflow_simulator.py stf_datataking \
0710     --workflow-config fast_processing_default \
0711     --stf-count 10
0712 ```
0713
0714 **Immutable definitions:** Workflow definitions are now immutable once created. The definition captures the source code and configuration at creation time. Each execution records its specific git version for reproducibility.
0715
0716 **Source traceability:** Workflow definitions now link to their source script on GitHub. Executions record the exact git commit, so you can always trace back to the code that ran.
0717
0718 ### Fast Processing Support
0719
0720 New infrastructure for fast processing workflows that sample STF data for near real-time monitoring:
0721
0722 - **Fast processing agent** (`example_agents/fast_processing_agent.py`) creates TF slices from STF samples
0723 - Configurable sampling rate, slices per sample, and processing time
0724 - Agents can start mid-run and extract context from messages
0725 - New monitor views: TF Slices (`/tf-slices/`) and Run States (`/run-states/`)
0726
0727 ### Agent Improvements
0728
0729 - Agents now register using the workflow name as their type (e.g., `STF_Datataking` instead of generic `workflow_runner`)
0730 - Retry logic for initial ActiveMQ connection improves reliability on startup
0731 - Agent list in monitor now supports type and status filters
0732
0733 ### Infrastructure
0734
0735 - Docker-compose updated with Redis and health checks
0736 - Artemis queue configuration guide added (`docs/artemis-queue-configuration.md`)
0737 - Fixed environment loading that was breaking git commands when `~/.env` contained PATH references
0738
0739 ---
0740
0741 *For detailed technical changes, see the pull requests for [swf-testbed](https://github.com/BNLNPPS/swf-testbed/pulls), [swf-common-lib](https://github.com/BNLNPPS/swf-common-lib/pulls), and [swf-monitor](https://github.com/BNLNPPS/swf-monitor/pulls).*