Back to home page

EIC code displayed by LXR

 
 

    


Warning, /swf-testbed/RELEASE_NOTES.md is written in an unsupported language. File is not indexed.

0001 # Release Notes
0002 
0003 ## v34 (2026-04-21)
0004 
0005 ### Streaming MCP Moved Off mod_wsgi (swf-monitor)
0006 
0007 The `/swf-monitor/mcp/` endpoint now runs on a dedicated ASGI worker (uvicorn, `swf-monitor-mcp-asgi.service` on `127.0.0.1:8001`) behind Apache `ProxyPass`. Everything else (`/about/`, `/api/`, `/accounts/login/`, PCS, static files) stays on mod_wsgi.
0008 
0009 **Why:** `django-mcp-server` uses Starlette's `StreamableHTTPSessionManager`. Under WSGI, each streaming MCP session holds a thread via `async_to_sync` for the full session lifetime. A handful of concurrent MCP clients (OpenCode, Claude Code CLI, Ollama-backed scripts, python-httpx — any streamable-HTTP MCP client) was enough to saturate the pool and 503 every dynamic URL on the site. Isolating `/mcp/` on an async worker removes that failure mode from the main app.
0010 
0011 **What changed operationally:**
0012 
0013 - mod_wsgi tuned for burst resilience: `threads=30`, `listen-backlog=500`, `queue-timeout=30`, `inactivity-timeout=300`, `graceful-timeout=15` — no `request-timeout` (would truncate `/api/messages/stream/` SSE long-poll).
0014 - Proxy tuned for streaming: `timeout=3600 keepalive=On disablereuse=On`, `proxy-sendchunked`, `no-gzip`, `CacheDisable` on `/mcp/`.
0015 - `swf-monitor-mcp-asgi.service` systemd unit added (`Restart=always`, 2 uvicorn workers).
0016 - `src/swf_monitor_project/asgi.py` cleaned up — removed dead `mcp_app.routing` import (the module was replaced by the `mcp_server` package long ago; ASGI entrypoint was quietly broken).
0017 
0018 ### Apache Config Auto-Sync on Deploy (swf-monitor)
0019 
0020 `apache-swf-monitor.conf` in the repo is now the source of truth. `deploy-swf-monitor.sh` diffs it against the live `/etc/httpd/conf.d/swf-monitor.conf` on every deploy; if different, it backs up live, installs from the release, validates with `httpd -t`, and rolls back on failure. The Apache reload that happens every deploy (to recycle mod_wsgi for new Python code) picks up any conf change along with it.
0021 
0022 **Why it matters:** there was a 6-week drift — the Mar 11 `dce7abf` fix for MCP IP restriction was committed to the repo but never reached live Apache because nothing copied it. `setup-apache-deployment.sh` regenerated the conf from a hardcoded heredoc (that had drifted from the repo canonical), and `deploy-swf-monitor.sh` didn't touch Apache conf at all. Closed: setup script now `cp`s `apache-swf-monitor.conf` and splits the dynamic `LoadModule` line out to `/etc/httpd/conf.modules.d/20-swf-monitor-wsgi.conf`.
0023 
0024 **ASGI worker is also recycled on every deploy** — uvicorn loads code once at startup, so fresh Python code requires a restart. Bots already follow the same pattern (conditional on bot-specific code change).
0025 
0026 ### PanDA Mattermost Bot — Multi-Server MCP with Progressive Tool Loading (swf-monitor)
0027 
0028 The PanDA bot now orchestrates across **seven external MCP servers** plus the local swf-monitor MCP, selecting tools based on the user's question. New integrations:
0029 
0030 - **LXR MCP server** (`github.com/BNLNPPS/lxr-mcp-server`, new this release) — EIC code browser cross-reference. `lxr_ident` (definitions + references), `lxr_search` (ripgrep across repos), `lxr_source` (read source with line numbers), `lxr_list` (browse directories).
0031 - **uproot MCP server** (`github.com/eic/uproot-mcp-server`) — inspect ROOT files: list branches, read arrays, sample contents.
0032 - **JLab-Rucio and BNL-Rucio MCP servers** — query Rucio for EIC datasets, replicas, and rules.
0033 - **GitHub MCP server** — now uses the `epic-capybara` service account with write access for bot-driven automation on EIC repos.
0034 - **epicdoc** — RAG search over ePIC documentation (`epic_doc_search`, `epic_doc_contents`). Runs in-process inside the bot (not as a separate MCP server, not inside WSGI — initial attempt to host it in WSGI brought the monitor down and was moved; see the debugging notes in the 2026-03-31 assessment).
0035 
0036 With that many tools, "send the whole catalog to the LLM every turn" stops working. Two new techniques address that:
0037 
0038 - **Progressive tool loading via semantic similarity.** For each user question the bot embeds the question and ranks tools by server-prefixed cosine similarity, auto-truncating at a score cliff. The LLM sees a small, relevant slice rather than all hundreds of tools — and the rank is preserved through the display so the LLM can judge relevance.
0039 - **3-tier tool awareness.** Every tool is visible by name + one-line catalog entry in the system prompt, so the LLM knows the full surface area exists at minimal token cost. Detailed schemas are fetched only for tools the LLM explicitly selects via `select_tools`. Server and suggestion context carries forward across thread turns, so follow-ups don't re-select from scratch.
0040 
0041 **Other bot improvements:**
0042 
0043 - **System prompt externalized** to `monitor_app/panda/system_prompt.txt` and re-read on every message — prompt iteration no longer requires a bot restart.
0044 - **DPID detection hardened.** For job/task questions the bot verifies that any Data Provenance ID in the reply came from actual tool output before letting it through. Detection is now line-based and format-agnostic; trigger word **AND** a matching ID must both be present.
0045 - **Bamboo log analysis** integrated into `panda_study_job` for failed jobs — surfaces Harvester pilot-log analysis automatically when filebrowser lookup fails. Exposed to the LLM via an explicit `log_analysis` field the bot is instructed to surface.
0046 - **Response style rules** in the system prompt curb overenthusiastic replies (e.g., verbose explanations when a one-line answer suffices).
0047 - Server-side matplotlib plot rendering, nightly cron scripts to auto-update each MCP server repo.
0048 
0049 ### New swf-monitor MCP Tool: `panda_harvester_workers`
0050 
0051 Live Harvester pilot/worker counts via bamboo's `askpanda_atlas`. Useful for "what pilots are running right now?" without needing to grep through Harvester logs.
0052 
0053 ```python
0054 panda_harvester_workers(status='running', site='NERSC', resourcetype='SCORE', days=1)
0055 ```
0056 
0057 Returns totals plus breakdown by status, site, and resourcetype. Clean, LLM-friendly response format.
0058 
0059 ### PCS — Compose UX Polish + Programmatic Submission Path (swf-monitor)
0060 
0061 **Compose pages (Physics/EvGen/Simu/Reco tags, Datasets, Prod Configs, Prod Tasks):**
0062 
0063 - Uniform button styling — all filled (solid) variants, dark-green accent on live edited values, consistent New-button placement in the left panel across all compose views.
0064 - Breadcrumbs and Cancel buttons point to compose views instead of the legacy list views.
0065 - Name-based URL params so compose views are bookmarkable and deep-linkable.
0066 - Owner-only edit enforcement on production configs (same discipline as tag edits).
0067 - Edit / Copy / New buttons no longer silently fail on prod config compose (previous type-argument mismatch fixed).
0068 - Compose panels for `command` and `taskParamMap` grow to fit content instead of forcing horizontal scroll.
0069 - Fixed type-argument mismatch in compose URL sync.
0070 
0071 **Production Tasks — submission artifacts:**
0072 
0073 A single read-only endpoint regenerates a task's submission artifact from current PCS state on every call (no DB writes):
0074 
0075 ```
0076 GET /swf-monitor/pcs/api/prod-tasks/command/?name=<task_name>&fmt=<format>
0077 ```
0078 
0079 | `fmt` | Contents |
0080 |-------|----------|
0081 | `condor` | env-prefixed `submit_csv.sh` command |
0082 | `panda` | `prun` command |
0083 | `jedi` | `taskParamMap` for `Client.insertTaskParams()` |
0084 | `dump` | Full view: task + dataset + all four tags + prod config + effective config |
0085 
0086 The parameter is `fmt` because DRF reserves `format` for its own content-negotiation.
0087 
0088 **New CLI `pcs-task-cmd`** — stdlib-only Python client over that endpoint. The recommended way for production operators and automation to fetch submission artifacts (no Django import, no DB credentials):
0089 
0090 ```bash
0091 # Inspect a task
0092 pcs-task-cmd <task_name> --format dump
0093 
0094 # Submit to JEDI (requires valid PanDA auth)
0095 pcs-task-cmd <name> --format jedi | python -c '
0096 import json, sys
0097 from pandaclient import Client
0098 print(Client.insertTaskParams(json.load(sys.stdin)))
0099 '
0100 
0101 # Pipe Condor command into bash
0102 eval "$(pcs-task-cmd <name> --format condor)"
0103 ```
0104 
0105 Environment: `SWFMON_URL` (default `https://epic-devcloud.org/prod`), optional `SWFMON_TOKEN` for non-public deployments.
0106 
0107 **JEDI taskParamMap now surfaced on task detail** — `build_task_params()` renders the full param map users will submit, viewable and copyable directly from the compose page.
0108 
0109 ### Deploy-Script Improvements (swf-monitor)
0110 
0111 - **`swf-monitor-mcp-asgi.service` restart step** — always restarts on deploy (uvicorn needs it).
0112 - **Apache conf sync** — described above.
0113 - **Shared HuggingFace cache** — `deploy-swf-monitor.sh` ensures `/opt/swf-monitor/shared/hf_cache` exists with open perms and appends `HF_HOME=` to `production.env` if missing. Bamboo and epicdoc reuse the cache across processes.
0114 - **Bot restarts after health check, not before** — avoids killing bots mid-request if Apache comes up broken.
0115 - **Nightly cron** (`nightly-update-mcp-servers.sh`, `nightly-update-epicdoc.sh`) — auto-updates sibling MCP-server repos and re-ingests ePIC documentation into epicdoc's ChromaDB store.
0116 
0117 ### PanDA Production Monitoring — Job Deep-Dive Enhancements (swf-monitor)
0118 
0119 - **NERSC portal log URLs** surfaced for Perlmutter jobs in `panda_study_job` — clickable links to the NERSC job portal alongside existing Harvester log URLs.
0120 - **Bamboo log analysis** runs on failed jobs automatically; LLM-friendly `log_analysis` field with fallback to Harvester URL when filebrowser fails.
0121 - **Error field rename** in `/panda job` output (source → component) — fixes a KeyError that surfaced on some job records.
0122 
0123 ### Auth & API Changes (swf-monitor)
0124 
0125 - **`TunnelAuthMiddleware`** now requires an `X-Remote-User` header before auto-authenticating — anonymous proxy requests no longer get a free pass. Matches the threat model of the TunnelAuthentication DRF backend (also checks the header before acting).
0126 - **`/api/users/`** response now includes `email`, `first_name`, `last_name` — enables richer devcloud account sync.
0127 
0128 ### Documentation
0129 
0130 - **`PRODUCTION_DEPLOYMENT.md`** refreshed for the two-backend layout, new setup-apache-deployment.sh behavior, and the full deploy step list (conf sync, ASGI worker restart).
0131 - **`MCP.md`** — ASGI/WSGI split documented, transport description corrected (it IS streamable HTTP), tool summary count corrected to 44, all tool categories added.
0132 - **`PCS.md`** — MCP Tools table corrected to the tools that actually exist.
0133 - **JEDI design docs** added: `JEDI_INTEGRATION.md` (architecture, field mapping, implementation plan) and `JEDI_EPIC_PROPOSAL.md` (technical proposal for PanDA team review) — roadmap for direct task submission to JEDI replacing the current `prun` CLI text generation.
0134 
0135 ### Agent Resilience (swf-common-lib)
0136 
0137 Further hardening of the BaseAgent lifecycle under unreliable infrastructure:
0138 
0139 - **Agent-ID registration retries indefinitely** on API failure (previously gave up after a bounded number of attempts). Agents starting into a partially-up monitor no longer silently fail to register.
0140 - **Improved resilience to server restarts** — agents survive transient monitor outages and resume their heartbeat loop cleanly on reconnection.
0141 
0142 ### swf-testbed — Upstream Contributions Integrated
0143 
0144 Several contributions landed direct-to-main during and just before the v34 cycle that were not acknowledged in earlier release notes. They are part of main as of this release. With thanks:
0145 
0146 **Agent code consolidation — Dmitry Kalinkin (PR #35, #36)**
0147 
0148 Unified agent code into the `swf-testbed` repository:
0149 
0150 - **PR #35 "Import SOTA agents"** — imports `agents/data_agent.py` and `agents/processing_agent.py` with full git history from the sibling repositories `BNLNPPS/swf-data-agent` and `BNLNPPS/swf-processing-agent`. Supersedes the shell of earlier example agents with BaseAgent-derived implementations (Rucio / XRootD integration, MQ handlers, dataset lifecycle).
0151 - **PR #36 "Delete superseded agents"** — final cleanup once the unified `agents/` package stabilized: removes `example_agents/daq_simulator_superseded.py`, `example_agents/example_daqsim_agent_superseded.py`, and `example_agents/processing_agent.py`.
0152 
0153 **Prompt-processing workflow — Zhaoyu Yang (PR #37, #38)**
0154 
0155 A new streaming workflow for prompt processing of time-frame slices, built on top of Dmitry's imported agents package:
0156 
0157 - `agents/prompt_processing_agent.py` — new agent for the prompt-processing pipeline
0158 - `workflows/prompt_processing.py`, `workflows/prompt_processing.toml`, `workflows/prompt_processing_default.toml` — workflow definition and default config
0159 - Orchestrator wiring in `workflows/orchestrator.py`; supervisord entry in `agents.supervisord.conf`
0160 - `scripts/dummy_stf_processing.sh` — placeholder payload for development
0161 - Refactor updates to `agents/data_agent.py` supporting the new flow
0162 - Documentation: `docs/prompt-processing-workflow.md`, architecture image `docs/images/prompt-processing-workflow.png`, `docs/skills-for-testbed.md`
0163 
0164 **CRIC endpoint / queue-config expansions — Xin Zhao (PR #34)**
0165 
0166 - `config/ddm_endpoints.json` — substantial DDM endpoint additions (+465 lines)
0167 - `config/panda_queues.json` — PanDA queue config additions (+1030 lines)
0168 - Reflects updated CRIC-sourced site/endpoint data for ePIC production
0169 
0170 ### swf-testbed — Baseline Branch Work
0171 
0172 No user-facing changes on the `infra/baseline-v34` branch itself — administrative commits only (CLAUDE.md branch-reference updates, v33 release notes catch-up, v34 release notes including this acknowledgments section).
0173 
0174 ---
0175 
0176 ## v33 (2026-03-29)
0177 
0178 ### Dual-Mode UI: ePIC Production / ePIC Testbed (swf-monitor)
0179 
0180 The monitor now operates in two modes, selectable via a nav bar toggle (localStorage-persisted):
0181 
0182 - **ePIC Production** (`/prod/`) — PanDA production monitoring (activity, jobs, tasks, errors, diagnostics, queues) + PCS (tags, datasets, prod configs, prod tasks). Shared PCS sections template keeps PCS hub and production hub in sync.
0183 - **ePIC Testbed** (`/testbed/`) — Streaming workflow testbed: workflows, time frame data, agents, messaging, system state, PanDA/Rucio.
0184 
0185 Root URL redirects based on mode. About page updated for dual-mode, all access methods, tech stack.
0186 
0187 ### PanDA Production Pages (swf-monitor)
0188 
0189 Full DataTables views for **Activity, Jobs, Tasks, Errors, Diagnostics**. **EIC PanDA Queues** from live schedconfig with MCP tools (`panda_list_queues`, `panda_get_queue`). **`panda_resource_usage`** for allocated vs used core-hours. **`panda_study_job`** for deep single-job analysis. **`destinationse`** (destination storage element) from filestable4 added to job listings and error summary. PanDA query modules refactored into `constants.py`, `sql.py`, `queries.py`. Monitor links point to epic-devcloud.org.
0190 
0191 ### PCS Auth & Proxy Support (swf-monitor)
0192 
0193 Full PCS functionality through the swf-remote (epic-devcloud.org) proxy:
0194 
0195 - **`TunnelAuthentication`** DRF backend — authenticates localhost/tunnel requests via `X-Remote-User` header without CSRF enforcement
0196 - **`IsAuthenticatedOrReadOnly`** on all PCS API viewsets — anonymous GET, auth required for writes
0197 - **`created_by` from `request.user`** — read-only in serializers, set server-side
0198 - **Tag delete API** — `POST /delete/` with creator-only, draft-only enforcement
0199 - **All PCS templates** converted from form POST to JS fetch → REST API
0200 - **`/api/users/`** endpoint with password hash for devcloud account sync
0201 
0202 ### Mattermost PanDA Bot (swf-monitor)
0203 
0204 - **4 MCP server types**: HTTP (PanDA, PCS), stdio (XRootD, GitHub, Zenodo)
0205 - **DPID (Data Provenance ID)** anti-fabrication: bot verifies LLM cited a real DPID, strips from user reply, warns if verification fails
0206 - **`/panda` slash commands** — status, errors, jobs/tasks with status filter and pagination, job/task detail, sites, site detail, help
0207 - **`bot_manage_servers`** virtual tool — list with versions, update/rebuild/restart
0208 - **Server-side matplotlib plots** in Mattermost
0209 - System prompt: data integrity rules, security rules, "never ask user to look something up"
0210 
0211 ### MCP Servers
0212 
0213 - **Zenodo** (`eic/zenodo-mcp-server`) — search, inspect, download from zenodo.org
0214 - **XRootD** (`eic/xrootd-mcp-server`) — file browsing and reading on JLab XRootD
0215 - **GitHub** (`github/github-mcp-server`) — read-only repo, issue, PR, actions access
0216 - **StdioMCPClient** transport for managing external MCP server subprocesses
0217 
0218 ### Agent Resilience (swf-common-lib, swf-testbed)
0219 
0220 - API retry with exponential backoff (swf-common-lib)
0221 - Agent manager: supervisord health verification, SIGUSR1 heartbeat, exit heartbeat on shutdown
0222 - check-testbed skill and supervisord health monitoring
0223 - AI memory hooks for cross-session dialogue persistence
0224 
0225 ### Bug Fixes
0226 
0227 - Namespace datatable: `Count('id')` on model without `id` field
0228 - `list_tasks`: stale filter params misaligned with where clauses
0229 - Django 5+ logout requires POST
0230 - Workflow parameter override: auto-discover all config sections
0231 
0232 ## v32 (2026-03-02)
0233 
0234 ### PCS (Physics Configuration System) — New Django App (swf-monitor)
0235 
0236 A new Django app for configuring production tasks based on physics inputs for ePIC Monte Carlo simulation campaigns. PCS organizes configurations as tags — named parameter sets for each stage of the MC pipeline:
0237 
0238 - **Physics tags (p):** process, beam energies, species, Q2 range
0239 - **EvGen tags (e):** event generator and version
0240 - **Simu tags (s):** detector simulation config
0241 - **Reco tags (r):** reconstruction config
0242 
0243 Tags have a draft/locked lifecycle. Locked tags are immutable and used in production.
0244 
0245 **Tag compose UI:** Split-panel interface for browsing, creating, editing, copying, and locking tags. Arrow key navigation, parameter filter dropdowns, inline editing with suggestion bars, predicted tag numbering, and diff highlighting for edits. Generalized for all four tag types with category-conditional fields.
0246 
0247 **Seeded data:** `seed_campaign_tags` management command creates 64 tags from the 26.02.0 campaign (47 physics, 15 evgen, 1 simu, 1 reco).
0248 
0249 **MCP tools:** `pcs_list_tags`, `pcs_get_tag`, `pcs_search_tags`.
0250 
0251 ### PanDA Mattermost Bot (swf-monitor)
0252 
0253 Claude-based production monitoring chatbot in Mattermost. Listens in the `#pandabot` channel, answers questions using Claude Haiku with tool use.
0254 
0255 - Discovers tools from MCP server automatically
0256 - System prompt built from MCP server instructions, stays in sync with deployed tool documentation
0257 - Supports PanDA and PCS tools
0258 - Thread-aware conversations
0259 
0260 ### PanDA Web Monitor (swf-monitor)
0261 
0262 New web views for ePIC-focused PanDA production monitoring:
0263 
0264 - Activity overview, job list, task list, job detail, task detail, error summary, job diagnostics
0265 - Cross-linking, days selector, server-side DataTables, colored status badges
0266 - Shares data layer with MCP tools via factored `panda/` package (`constants.py`, `sql.py`, `queries.py`)
0267 
0268 ### PanDA MCP Tools — New and Enhanced (swf-monitor)
0269 
0270 Six new tools for PanDA production monitoring via MCP:
0271 
0272 - `panda_list_jobs` — job overview with summary stats, cursor-based pagination
0273 - `panda_list_tasks` — JEDI task monitoring with workinggroup/processingtype filters
0274 - `panda_get_activity` — pre-digested activity overview (aggregate counts, no individual records)
0275 - `panda_error_summary` — aggregate error ranking across failed jobs
0276 - `panda_diagnose_jobs` — failed job diagnostics with all 7 error component fields
0277 - `panda_study_job` — deep single-job analysis (~40 fields, filestable, condor logs, structured errors)
0278 
0279 ### MCP Infrastructure (swf-monitor)
0280 
0281 - Refactored monolithic `mcp.py` (2,544 lines) into `mcp/` package
0282 - AI memory model and REST API for cross-session dialogue persistence
0283 - Fixed `_get_username()`: use SWF_HOME directory ownership instead of `getpass.getuser()` (returns 'apache' under WSGI)
0284 - Fixed fastmon-files API to accept STF filename string instead of requiring UUID
0285 - Added Bootstrap 5 CSS
0286 
0287 ### Documentation Cleanup
0288 
0289 Deleted 9 stale or superseded files across both repos (1,800+ lines removed): old monolithic README backup, abandoned design docs, failed procedure docs, one-time reports, broken index pages. Fixed hardcoded credentials in installation guide, dead links, malformed markdown, and updated CLAUDE.md branch reference to v32.
0290 
0291 ### swf-common-lib
0292 
0293 No changes in v32.
0294 
0295 ---
0296 
0297 ## v31 (2026-02-18)
0298 
0299 ### Robustness Improvements for LLM-driven Testbed Controls
0300 
0301 Hardened the MCP control path so AI agents can reliably start, monitor, and manage testbed workflows without misinterpreting system state.
0302 
0303 **Testbed status fixes (swf-monitor):**
0304 - `ready` field now checks running workflow executions, not agent count — was permanently false when agents were idle after a completed workflow
0305 - REST heartbeat no longer overwrites `workflow_enabled` to false on every heartbeat
0306 - `start_workflow` namespace resolution falls back to running agent manager's namespace when env var unavailable in Apache context
0307 - `start_user_testbed` no longer destroys the agent manager on every start
0308 - Surfaced supervisord health and agent manager errors in MCP status tools
0309 - Fixed MCP username resolution: use SWF_HOME directory ownership, require explicit username parameter
0310 
0311 **Agent manager hardening (swf-testbed):**
0312 - Verify supervisord health, check agent starts, log errors instead of failing silently
0313 - SIGUSR1 heartbeat refresh after check-testbed fixes
0314 - Exit heartbeat on shutdown so DB immediately reflects agent manager death
0315 - check-testbed skill for bootstrapping infrastructure
0316 - Fixed workflow parameter override to auto-discover all config sections
0317 
0318 **Workflow monitoring guidance:**
0319 - MCP docs now instruct AI to actively poll `swf_get_workflow_monitor` during execution rather than sleeping
0320 
0321 **Other:**
0322 - AI memory hooks and documentation for cross-session dialogue persistence
0323 - Refactored monolithic mcp.py into package (system, workflows, ai_memory, common)
0324 
0325 ## v30 (2026-02-03)
0326 
0327 ### Auth0 OAuth 2.1 Authentication for Claude.ai MCP
0328 
0329 Added secure OAuth 2.1 authentication for remote MCP connections from Claude.ai, using [Auth0](https://auth0.com/) as the identity provider.
0330 
0331 **How it works:**
0332 1. Claude.ai discovers OAuth metadata via `/.well-known/oauth-protected-resource`
0333 2. User authenticates with Auth0 (redirected to Auth0's login page)
0334 3. Auth0 issues JWT access token to Claude.ai
0335 4. Claude.ai includes Bearer token in MCP requests
0336 5. Django middleware validates JWT against Auth0's JWKS endpoint
0337 
0338 **Configuration:**
0339 ```bash
0340 AUTH0_DOMAIN=your-tenant.us.auth0.com
0341 AUTH0_CLIENT_ID=your-client-id
0342 AUTH0_CLIENT_SECRET=your-client-secret
0343 AUTH0_API_IDENTIFIER=https://your-server/swf-monitor/mcp
0344 ```
0345 
0346 **Access modes:**
0347 - **Claude.ai (remote)**: Requires OAuth authentication via Auth0
0348 - **Claude Code (local)**: POST requests pass through without auth for local development
0349 
0350 **Network requirement:** Claude.ai connects from Anthropic's servers, so the MCP endpoint must be accessible from the public internet.
0351 
0352 ### MCP Tool Naming Convention
0353 
0354 Renamed all 29 MCP tools with `swf_` service prefix for multi-server discovery:
0355 - `list_agents` → `swf_list_agents`
0356 - `get_system_state` → `swf_get_system_state`
0357 - etc.
0358 
0359 This follows MCP best practices for environments where multiple MCP servers are connected. The prefix enables clean tool discovery and avoids naming collisions.
0360 
0361 Reference: https://www.philschmid.de/mcp-best-practices
0362 
0363 ### Pagination Metadata for List Tools
0364 
0365 All list tools now return pagination metadata to help LLMs manage context:
0366 
0367 ```json
0368 {
0369   "items": [...],
0370   "total_count": 1523,
0371   "has_more": true,
0372   "monitor_urls": [...]
0373 }
0374 ```
0375 
0376 - `total_count`: Total matching records in database
0377 - `has_more`: Boolean indicating results are truncated
0378 
0379 This helps LLMs understand when query results are incomplete and whether to refine filters.
0380 
0381 ### New MCP Tool: swf_send_message
0382 
0383 Send messages to the workflow monitoring stream:
0384 
0385 ```python
0386 swf_send_message(
0387     message="Test message",
0388     message_type="announcement",  # or "test", custom types
0389     metadata={"key": "value"}     # optional
0390 )
0391 ```
0392 
0393 Use cases:
0394 - Testing the message pipeline end-to-end
0395 - Sending announcements to colleagues monitoring the stream
0396 - Debugging SSE relay functionality
0397 
0398 ### Message Type Standardization
0399 
0400 Standardized on `stf_ready` message type across all agents. Previously some agents used `data_ready` inconsistently. Updated `WORKFLOW_MESSAGE_TYPES` in swf-common-lib and all example agents. Also added `tf_file_registered` to the canonical message types.
0401 
0402 ### Bug Fixes
0403 
0404 - **Fixed monitor URLs in MCP responses**: Tool responses were returning localhost URLs instead of production URLs. Now correctly returns URLs based on deployment configuration.
0405 
0406 ### Documentation
0407 
0408 - Updated `docs/MCP.md` with all swf_ prefixed tool names
0409 - Documented Auth0 OAuth 2.1 configuration and flow
0410 - Added pagination metadata documentation
0411 - Documented ActiveMQ connection patterns and messaging semantics
0412 - Noted that `.env` files are not deployed from git (must be configured on server)
0413 - **CLAUDE.md overhaul**: Streamlined per Anthropic best practices, added operational guidelines
0414 - **MCP tool change guidance**: When adding/modifying MCP tools, must update `swf_list_available_tools()` hardcoded list in mcp.py
0415 
0416 ---
0417 
0418 ## v29 (2026-01-25)
0419 
0420 ### Per-User Configuration Override (SWF_TESTBED_CONFIG)
0421 
0422 A new environment variable `SWF_TESTBED_CONFIG` enables per-user configuration overrides across all core repositories. This allows multiple users to run their own testbed instances with different configurations on the same system.
0423 
0424 **Usage:**
0425 ```bash
0426 export SWF_TESTBED_CONFIG=/path/to/my-testbed.toml
0427 testbed run  # Uses your custom config instead of workflows/testbed.toml
0428 ```
0429 
0430 This is supported in swf-testbed, swf-monitor (MCP tools), and swf-common-lib (BaseAgent).
0431 
0432 ### Agent Manager Enhancements
0433 
0434 The user agent manager daemon introduced in v28 has been significantly improved:
0435 
0436 - **Config-driven namespace and agent selection**: The agent manager now reads namespace and agent configuration from testbed.toml, enabling different users to run different agent sets
0437 - **REST logging**: Agent manager logs are now sent to swf-monitor for centralized viewing via `list_logs()`
0438 - **Restart command**: New `restart` command for reloading configuration without full stop/start cycle
0439 - **Immediate heartbeat**: Agent manager sends heartbeat immediately on startup, not after the first interval
0440 - **Clean disconnect**: Proper cleanup on restart prevents stale connection state
0441 - **Venv path handling**: Improved virtual environment path resolution
0442 
0443 ### New MCP Tool: get_testbed_status
0444 
0445 A comprehensive status tool that combines agent manager, namespace, and workflow agent information in a single call.
0446 
0447 ```python
0448 get_testbed_status(username='wenauseic')
0449 ```
0450 
0451 Returns:
0452 - Agent manager status (alive, namespace, control queue)
0453 - Summary of running/stopped agents
0454 - List of all workflow agents with current state
0455 
0456 This replaces the need to call multiple tools to understand testbed readiness.
0457 
0458 ### MCP Improvements
0459 
0460 - **SWF_TESTBED_CONFIG support**: MCP tools respect the per-user config override
0461 - **start_user_testbed safety check**: Refuses to start if workflow agents are already running - user must call stop_user_testbed first to ensure clean state
0462 - **Log filtering fixes**: Multiple fixes to username extraction in log list views - now correctly filters by the username segment in agent instance names
0463 - **Heartbeat API fix**: The heartbeat endpoint now properly updates operational_state, pid, and hostname fields
0464 - **monitor_urls in responses**: MCP tool responses include links to relevant monitor UI pages
0465 
0466 ### Documentation
0467 
0468 New architectural documentation with SVG diagrams:
0469 - **docs/agent-management.md**: Agent lifecycle, supervisord integration, agent manager architecture
0470 - **docs/fast-processing-workflow.md**: Fast processing pipeline, TF slice workflow, worker coordination
0471 - **5 SVG diagrams**: Visual architecture diagrams for agent management and fast processing
0472 
0473 Updated MCP documentation with Claude Code settings examples and query best practices.
0474 
0475 ### Signal Handlers (swf-common-lib)
0476 
0477 BaseAgent now includes signal handlers for SIGTERM and SIGINT, enabling cleaner shutdown behavior when agents are terminated by supervisord or manually.
0478 
0479 ---
0480 
0481 ## v28 (2026-01-13)
0482 
0483 ### ActiveMQ Destination Prefix Requirement (Breaking Change)
0484 
0485 **All ActiveMQ destinations now require explicit `/queue/` or `/topic/` prefix.** This is a breaking change that affects all agent code sending messages.
0486 
0487 **Before (incorrect):**
0488 ```python
0489 self.send_message('epictopic', message)  # WRONG - bare name
0490 ```
0491 
0492 **After (correct):**
0493 ```python
0494 self.send_message('/topic/epictopic', message)  # Correct - explicit prefix
0495 ```
0496 
0497 **Why this matters:**
0498 - Bare destination names were ambiguous - ActiveMQ behavior depends on broker configuration
0499 - Explicit prefixes make the routing intention clear: `/queue/` for anycast (one consumer) vs `/topic/` for multicast (all consumers)
0500 - BaseAgent now validates destination format and raises `ValueError` for bare names
0501 
0502 Existing code using bare names will fail immediately with a clear error message explaining the required format. All example agents and workflow code have been updated.
0503 
0504 ### MCP Workflow Control - AI-Driven Operations
0505 
0506 The MCP service now provides **full workflow control**, enabling AI assistants to start, stop, and monitor workflows without requiring CLI access. This is the key enabler for AI-driven testbed operations.
0507 
0508 **New workflow control tools:**
0509 - `start_workflow` - Start a workflow by sending a command to the DAQ Simulator agent. All parameters are optional; defaults are read from the user's `testbed.toml`. Override specific parameters (e.g., `stf_count=5`) while inheriting others from config.
0510 - `stop_workflow` - Stop a running workflow gracefully by execution_id. The workflow stops at the next checkpoint.
0511 - `end_execution` - Mark a stuck execution as terminated in the database. Use this to clean up stale executions that the agent can no longer reach.
0512 
0513 **New agent management tools:**
0514 - `kill_agent` - Send SIGKILL to an agent process by instance name. Looks up the agent's PID and hostname, kills if on the same host, and always marks the agent as EXITED in the database.
0515 
0516 **New monitoring tools:**
0517 - `get_workflow_monitor` - Aggregated view of workflow execution: status, phase, STF count, key events, and errors (from both messages and logs). Single-call alternative to polling multiple tools.
0518 - `list_workflow_monitors` - List recent executions (last 24h) that can be monitored.
0519 
0520 The MCP tool count has grown from 20 to **27 tools**. Documentation in `swf-monitor/docs/MCP.md` has been updated to reflect all tools.
0521 
0522 ### User Agent Manager - Per-User Testbed Control via MCP
0523 
0524 A new **agent manager daemon** enables MCP-driven control of per-user testbed agents. This allows AI assistants to start and stop a user's testbed without requiring SSH or terminal access.
0525 
0526 **Architecture:**
0527 - Each user runs a lightweight `testbed agent-manager` daemon in their swf-testbed directory
0528 - The daemon listens on a user-specific queue (`/queue/agent_control.<username>`) for commands
0529 - It manages supervisord-controlled agents and reports status via heartbeats
0530 
0531 **New MCP tools:**
0532 - `check_agent_manager(username)` - Check if a user's agent manager is alive. Returns heartbeat status, control queue name, and whether agents are running.
0533 - `start_user_testbed(username, config_name)` - Send start command to agent manager. Agents start asynchronously.
0534 - `stop_user_testbed(username)` - Send stop command to agent manager.
0535 
0536 **Usage:**
0537 ```bash
0538 # Start the agent manager daemon (run once, keeps running)
0539 cd /data/<username>/github/swf-testbed
0540 source .venv/bin/activate && source ~/.env
0541 testbed agent-manager
0542 ```
0543 
0544 Then an AI assistant can:
0545 1. Check readiness: `check_agent_manager(username='wenauseic')`
0546 2. Start testbed: `start_user_testbed(username='wenauseic')`
0547 3. Run workflows: `start_workflow()`
0548 4. Stop when done: `stop_user_testbed(username='wenauseic')`
0549 
0550 ### Persistent WorkflowRunner with Message-Driven Execution
0551 
0552 The WorkflowRunner agent has been redesigned as a **persistent, message-driven service** rather than a one-shot script.
0553 
0554 **Key changes:**
0555 - WorkflowRunner now starts with supervisord and listens on `/queue/workflow_control` for commands
0556 - Commands include `run_workflow` (from MCP `start_workflow`) and `stop_workflow`
0557 - Each execution gets a unique `execution_id` (e.g., `stf_datataking-wenauseic-0044`)
0558 - The `stop_workflow` command targets a specific execution by ID, enabling graceful termination
0559 
0560 **Why this matters:**
0561 - The WorkflowRunner is always ready to receive workflow commands - it doesn't need to be started for each run
0562 - This models the actual ePIC system more realistically, where the DAQ system is a persistent service
0563 - Workflows can be started and stopped via MCP without CLI access
0564 - Multiple workflows can be managed by execution_id
0565 
0566 ### Enhanced get_system_state - User Context and Readiness
0567 
0568 The `get_system_state` MCP tool now accepts a `username` parameter and provides user-specific context.
0569 
0570 **New fields returned:**
0571 - `user_context` - Namespace and workflow defaults from user's `testbed.toml`
0572 - `agent_manager` - Status of user's agent manager daemon (healthy/unhealthy/missing/exited)
0573 - `workflow_runner` - Status of DAQ Simulator in user's namespace
0574 - `ready_to_run` - Boolean indicating if the user can start a workflow
0575 - `last_execution` - Most recent workflow execution in user's namespace
0576 - `errors_last_hour` - Count of ERROR logs in user's namespace
0577 
0578 This enables AI assistants to answer questions like "Am I ready to run a workflow?" with a single call.
0579 
0580 ### EXITED Status and Agent Lifecycle
0581 
0582 Improved agent lifecycle management with explicit EXITED status handling.
0583 
0584 **Changes:**
0585 - Agents now set `status='EXITED'` and `operational_state='EXITED'` on clean shutdown
0586 - `list_agents` **excludes EXITED agents by default** - use `status='EXITED'` to see only exited, or `status='all'` to see all
0587 - `kill_agent` always marks agents as EXITED, even if the kill fails
0588 - EXITED agents don't clutter the active agent list but remain queryable for debugging
0589 
0590 **Migration:** A database migration (`0014_systemagent_exited_status.py`) adds the EXITED choice to the status field.
0591 
0592 ### Logging Context with execution_id
0593 
0594 Improved log traceability with execution context in log records.
0595 
0596 **Changes:**
0597 - New `_log_extra()` helper in BaseAgent returns consistent extra fields: `username`, `execution_id`, `run_id`
0598 - All agent log calls should use: `logger.info("message", extra=self._log_extra())`
0599 - `list_logs` MCP tool now supports `execution_id` parameter to filter logs by workflow execution
0600 
0601 **Usage:**
0602 ```python
0603 # In agent code
0604 self.logger.info("Processing STF", extra=self._log_extra())
0605 
0606 # Via MCP
0607 list_logs(execution_id='stf_datataking-wenauseic-0044')
0608 ```
0609 
0610 This enables tracing all log messages for a specific workflow execution, essential for debugging workflow failures.
0611 
0612 ### Monitor UI Improvements
0613 
0614 **Log detail page:** The log detail view (`/logs/<id>/`) now displays the `extra_data` JSON field when present. This shows execution context (execution_id, run_id, namespace, username) that agents include via `_log_extra()`. Previously this context was captured but not visible in the UI.
0615 
0616 **Log list filtering:** The log list now supports filtering by execution_id, complementing the existing app_name, instance_name, and level filters.
0617 
0618 ### Documentation Updates
0619 
0620 - **MCP.md** completely rewritten to document all 27 tools with accurate parameters and return values
0621 - Removed "Not Yet Implemented" section - all documented tools are now functional
0622 - Added sections for Workflow Control, Agent Management, User Agent Manager, and Workflow Monitoring
0623 - Updated tool count from 20 to 27
0624 
0625 ---
0626 
0627 ## v27 (2026-01-08)
0628 
0629 ### MCP Integration
0630 
0631 The swf-monitor now exposes a **Model Context Protocol (MCP)** API, enabling AI assistants like Claude to query and interact with the testbed system.
0632 
0633 **20+ MCP tools** for:
0634 - **System state**: `get_system_state`, `list_agents`, `get_agent`, `list_namespaces`
0635 - **Workflows**: `list_workflow_definitions`, `list_workflow_executions`, `get_workflow_execution`
0636 - **Data**: `list_runs`, `get_run`, `list_stf_files`, `get_stf_file`, `list_tf_slices`
0637 - **Messages & Logs**: `list_messages`, `list_logs`, `get_log_entry`
0638 
0639 **Auto-discovery**: Add `.mcp.json` to your project root for Claude Code to automatically connect:
0640 ```json
0641 {
0642   "mcpServers": {
0643     "swf-testbed": {
0644       "type": "sse",
0645       "url": "https://pandaserver02.sdcc.bnl.gov/swf-monitor/mcp/"
0646     }
0647   }
0648 }
0649 ```
0650 
0651 **Endpoint**: `https://pandaserver02.sdcc.bnl.gov/swf-monitor/mcp/`
0652 
0653 ### Agent Lifecycle Management
0654 
0655 Agents now report process information for lifecycle management:
0656 - **pid**: Process ID for kill operations
0657 - **hostname**: Host where agent is running
0658 - **operational_state**: STARTING → READY → PROCESSING → EXITED
0659 
0660 These fields enable future orchestration features like agent health monitoring and remote termination.
0661 
0662 ### Database Logging
0663 
0664 New `DbLogHandler` sends Python log records to the monitor database, enabling centralized log viewing:
0665 - View logs via monitor UI at `/logs/`
0666 - Filter by app, instance, level, time range
0667 - Query via MCP: `list_logs(level='ERROR')`, `get_log_entry(log_id)`
0668 
0669 ### BaseAgent Improvements
0670 
0671 - Agents report EXITED status on shutdown
0672 - Warning logged when sending messages without namespace set
0673 - Heartbeats include pid, hostname, operational_state
0674 
0675 ---
0676 
0677 ## v26 (2025-12-31)
0678 
0679 ### Namespaces
0680 
0681 Workflows now operate within **namespaces**, allowing users to isolate their work from others sharing the same infrastructure.
0682 
0683 On shared systems like pandaserver02, multiple users can run workflows simultaneously. Namespaces let you filter the monitor UI to see only your workflows, agents, and messages, and avoid conflicts with other users.
0684 
0685 Configure your namespace in `workflows/testbed.toml` before running any workflows:
0686 
0687 ```toml
0688 [testbed]
0689 namespace = "your-namespace"  # e.g., "alice-dev", "team-fastmon"
0690 ```
0691 
0692 All workflow messages now include the namespace, and the monitor UI provides namespace filtering on agents, executions, and messages.
0693 
0694 ### Monitor UI
0695 
0696 - **Namespace pages**: List and detail views; namespace column and filter on agents, executions, messages
0697 - **Agent list**: Type and status filters; click agent to see detail
0698 - **Agent detail**: Streamlined view linking to filtered workflow messages
0699 - **Workflow messages**: execution_id and run_id filters; STF count column; click for message detail
0700 - **Message detail**: Full message content view
0701 - **Drill-down links**: Click execution_id, run_id, namespace, or agent anywhere to navigate to details
0702 - **Source links**: GitHub links on workflow definition (branch) and execution (commit) pages
0703 
0704 ### Workflow Refinements
0705 
0706 **Count-based workflow completion:** Workflows can now run until a specific number of STF files are generated, rather than requiring a duration limit:
0707 
0708 ```bash
0709 python workflows/workflow_simulator.py stf_datataking \
0710     --workflow-config fast_processing_default \
0711     --stf-count 10
0712 ```
0713 
0714 **Immutable definitions:** Workflow definitions are now immutable once created. The definition captures the source code and configuration at creation time. Each execution records its specific git version for reproducibility.
0715 
0716 **Source traceability:** Workflow definitions now link to their source script on GitHub. Executions record the exact git commit, so you can always trace back to the code that ran.
0717 
0718 ### Fast Processing Support
0719 
0720 New infrastructure for fast processing workflows that sample STF data for near real-time monitoring:
0721 
0722 - **Fast processing agent** (`example_agents/fast_processing_agent.py`) creates TF slices from STF samples
0723 - Configurable sampling rate, slices per sample, and processing time
0724 - Agents can start mid-run and extract context from messages
0725 - New monitor views: TF Slices (`/tf-slices/`) and Run States (`/run-states/`)
0726 
0727 ### Agent Improvements
0728 
0729 - Agents now register using the workflow name as their type (e.g., `STF_Datataking` instead of generic `workflow_runner`)
0730 - Retry logic for initial ActiveMQ connection improves reliability on startup
0731 - Agent list in monitor now supports type and status filters
0732 
0733 ### Infrastructure
0734 
0735 - Docker-compose updated with Redis and health checks
0736 - Artemis queue configuration guide added (`docs/artemis-queue-configuration.md`)
0737 - Fixed environment loading that was breaking git commands when `~/.env` contained PATH references
0738 
0739 ---
0740 
0741 *For detailed technical changes, see the pull requests for [swf-testbed](https://github.com/BNLNPPS/swf-testbed/pulls), [swf-common-lib](https://github.com/BNLNPPS/swf-common-lib/pulls), and [swf-monitor](https://github.com/BNLNPPS/swf-monitor/pulls).*