swf-testbed/docs/agent-management.md

0001 # Agent Management
0002
0003 This document describes how workflow agents are started, stopped, and managed in the ePIC Streaming Workflow Testbed.
0004
0005 ## Overview
0006
0007 Agents are managed through two control paths:
0008 - **CLI** (`testbed` command) - for local operation
0009 - **MCP** (Model Context Protocol) - for AI-assisted remote operation
0010
0011 Both paths use supervisord for process management, ensuring consistent behavior.
0012
0013 ![Agent Management Overview](images/agent-management-overview-v4.svg)
0014
0015 ## CLI Control Path
0016
0017 ### Starting Agents and Workflows
0018
0019 ```bash
0020 testbed run                     # Uses workflows/testbed.toml
0021 testbed run fast_processing     # Uses workflows/fast_processing_default.toml
0022 ```
0023
0024 **Startup sequence:**
0025
0026 ```mermaid
0027 sequenceDiagram
0028     participant User
0029     participant CLI as testbed CLI
0030     participant Orch as orchestrator.py
0031     participant Supv as supervisord
0032     participant Agents
0033
0034     User->>CLI: testbed run
0035     CLI->>Orch: run(config_name)
0036     Orch->>Orch: Load config (testbed.toml)
0037     Orch->>Supv: Check running agents
0038
0039     alt Agents already running
0040         Supv-->>Orch: Running agents list
0041         Orch-->>CLI: Error: agents running
0042         CLI-->>User: "Run testbed stop-agents first"
0043     else No agents running
0044         Orch->>Supv: Restart supervisord
0045         Note over Supv: Picks up fresh env vars
0046         Orch->>Supv: Start workflow-runner
0047         Orch->>Supv: Start enabled agents
0048         Supv->>Agents: Launch processes
0049         Orch->>Agents: Send run_workflow command
0050         Orch-->>CLI: Success
0051         CLI-->>User: "Workflow triggered"
0052     end
0053 ```
0054
0055 ### Stopping Agents
0056
0057 ```bash
0058 testbed stop-agents             # Stop all workflow agents
0059 ```
0060
0061 This command:
0062 1. Connects to supervisord using `agents.supervisord.conf`
0063 2. Issues `supervisorctl stop all`
0064 3. All agent processes are terminated
0065
0066 **Important:** `testbed stop-agents` uses `agents.supervisord.conf` (for workflow agents), not `supervisord.conf` (which manages web services).
0067
0068 ## MCP Control Path
0069
0070 ### Agent Manager Daemon
0071
0072 The Agent Manager is a per-user daemon that bridges MCP commands to local supervisord:
0073
0074 ```bash
0075 testbed agent-manager           # Run in foreground
0076 nohup testbed agent-manager &   # Run in background
0077 ```
0078
0079 The daemon:
0080 - Listens on `/queue/testbed.{username}.control`
0081 - Sends heartbeats to the monitor
0082 - Executes start/stop commands via supervisorctl
0083
0084 ```mermaid
0085 flowchart LR
0086     subgraph Django["Django Monitor"]
0087         MCP["MCP Tool"]
0088     end
0089
0090     subgraph AMQ["ActiveMQ"]
0091         CQ["Control Queue<br/>/queue/testbed.{user}.control"]
0092     end
0093
0094     subgraph Local["User's Session"]
0095         AM["Agent Manager"]
0096         SUPV["supervisord"]
0097     end
0098
0099     MCP -->|"send command"| CQ
0100     CQ -->|"receive"| AM
0101     AM -->|"supervisorctl"| SUPV
0102     AM -.->|"heartbeat"| Django
0103 ```
0104
0105 ### MCP Tools
0106
0107 ```python
0108 # Check agent manager status
0109 check_agent_manager(username)
0110
0111 # Start testbed (agents + workflow runner)
0112 start_user_testbed(username, config_name="testbed.toml")
0113
0114 # Stop all agents
0115 stop_user_testbed(username)
0116
0117 # Start a workflow (after agents are running)
0118 start_workflow(namespace="torre2", stf_count=10)
0119
0120 # Stop a running workflow
0121 stop_workflow(execution_id="stf_datataking-wenauseic-0049")
0122
0123 # Comprehensive status
0124 get_testbed_status(username)
0125 ```
0126
0127 ### MCP Startup Sequence
0128
0129 ```mermaid
0130 sequenceDiagram
0131     participant MCP as MCP Tool
0132     participant AMQ as ActiveMQ
0133     participant AM as Agent Manager
0134     participant Supv as supervisord
0135     participant Agents
0136
0137     MCP->>AMQ: start_testbed command
0138     AMQ->>AM: Receive command
0139     AM->>AM: Load config
0140     AM->>Supv: Check running agents
0141
0142     alt Agents already running
0143         AM-->>MCP: Error: agents running
0144     else No agents running
0145         AM->>Supv: Restart supervisord
0146         AM->>Supv: Start all agents
0147         Supv->>Agents: Launch processes
0148         AM-->>MCP: Success
0149     end
0150 ```
0151
0152 ## Configuration
0153
0154 ### testbed.toml Structure
0155
0156 ```toml
0157 [testbed]
0158 namespace = "torre2"              # Isolation namespace
0159
0160 [agents.data]
0161 enabled = true                    # Enable this agent
0162 script = "example_agents/example_data_agent.py"
0163
0164 [agents.fastmon]
0165 enabled = true
0166
0167 [agents.fast_processing]
0168 enabled = true
0169
0170 [workflow]
0171 name = "stf_datataking"           # Default workflow
0172 config = "fast_processing_default"
0173 realtime = true
0174 ```
0175
0176 ### Environment Variables
0177
0178 Supervisord passes environment variables to agents:
0179
0180 ```ini
0181 # agents.supervisord.conf
0182 [program:example-data-agent]
0183 command=python example_agents/example_data_agent.py
0184 directory=%(ENV_SWF_HOME)s/swf-testbed
0185 environment=SWF_TESTBED_CONFIG="%(ENV_SWF_TESTBED_CONFIG)s"
0186 autostart=false
0187 autorestart=true
0188 ```
0189
0190 Key variables:
0191 | Variable | Purpose |
0192 |----------|---------|
0193 | `SWF_HOME` | Parent directory containing swf-* repos |
0194 | `SWF_TESTBED_CONFIG` | Path to testbed.toml |
0195 | `SWF_MONITOR_HTTP_URL` | Monitor REST API URL |
0196 | `SWF_API_TOKEN` | API authentication token |
0197
0198 **Important:** Supervisord must be restarted to pick up environment variable changes. Both CLI and MCP paths automatically restart supervisord on start.
0199
0200 ## Process Lifecycle
0201
0202 ```mermaid
0203 stateDiagram-v2
0204     [*] --> Stopped: Initial
0205
0206     Stopped --> Starting: testbed run /<br/>start_user_testbed
0207     Starting --> Running: Agents started
0208     Running --> Stopping: testbed stop-agents /<br/>stop_user_testbed
0209     Stopping --> Stopped: Agents stopped
0210
0211     Running --> Running: start_workflow /<br/>stop_workflow
0212
0213     note right of Starting
0214         1. Check no agents running
0215         2. Restart supervisord
0216         3. Start enabled agents
0217     end note
0218 ```
0219
0220 ## Troubleshooting
0221
0222 ### Agents Won't Start: "agents already running"
0223
0224 ```bash
0225 # Check what's running
0226 supervisorctl -c agents.supervisord.conf status
0227
0228 # Stop existing agents
0229 testbed stop-agents
0230
0231 # Now start fresh
0232 testbed run
0233 ```
0234
0235 ### Wrong Namespace
0236
0237 Agents use the namespace from the config file that was active when supervisord started. If agents are using the wrong namespace:
0238
0239 ```bash
0240 testbed stop-agents
0241 # Edit testbed.toml or set SWF_TESTBED_CONFIG
0242 testbed run
0243 ```
0244
0245 The restart of supervisord picks up the new configuration.
0246
0247 ### Agent Manager Not Responding
0248
0249 ```bash
0250 # Check if agent manager is running
0251 ps aux | grep user_agent_manager
0252
0253 # Check MCP status
0254 # (via MCP tool)
0255 check_agent_manager(username)
0256
0257 # Restart agent manager
0258 pkill -f user_agent_manager
0259 nohup testbed agent-manager &
0260 ```
0261
0262 ## See Also
0263
0264 - [Architecture Overview](architecture.md) - System design and components
0265 - [Fast Processing Workflow](fast-processing-workflow.md) - Workflow sequence diagram
0266 - [Operations Guide](operations.md) - Day-to-day operations