Warning, /swf-testbed/docs/agent-management.md is written in an unsupported language. File is not indexed.
0001 # Agent Management
0002
0003 This document describes how workflow agents are started, stopped, and managed in the ePIC Streaming Workflow Testbed.
0004
0005 ## Overview
0006
0007 Agents are managed through two control paths:
0008 - **CLI** (`testbed` command) - for local operation
0009 - **MCP** (Model Context Protocol) - for AI-assisted remote operation
0010
0011 Both paths use supervisord for process management, ensuring consistent behavior.
0012
0013 
0014
0015 ## CLI Control Path
0016
0017 ### Starting Agents and Workflows
0018
0019 ```bash
0020 testbed run # Uses workflows/testbed.toml
0021 testbed run fast_processing # Uses workflows/fast_processing_default.toml
0022 ```
0023
0024 **Startup sequence:**
0025
0026 ```mermaid
0027 sequenceDiagram
0028 participant User
0029 participant CLI as testbed CLI
0030 participant Orch as orchestrator.py
0031 participant Supv as supervisord
0032 participant Agents
0033
0034 User->>CLI: testbed run
0035 CLI->>Orch: run(config_name)
0036 Orch->>Orch: Load config (testbed.toml)
0037 Orch->>Supv: Check running agents
0038
0039 alt Agents already running
0040 Supv-->>Orch: Running agents list
0041 Orch-->>CLI: Error: agents running
0042 CLI-->>User: "Run testbed stop-agents first"
0043 else No agents running
0044 Orch->>Supv: Restart supervisord
0045 Note over Supv: Picks up fresh env vars
0046 Orch->>Supv: Start workflow-runner
0047 Orch->>Supv: Start enabled agents
0048 Supv->>Agents: Launch processes
0049 Orch->>Agents: Send run_workflow command
0050 Orch-->>CLI: Success
0051 CLI-->>User: "Workflow triggered"
0052 end
0053 ```
0054
0055 ### Stopping Agents
0056
0057 ```bash
0058 testbed stop-agents # Stop all workflow agents
0059 ```
0060
0061 This command:
0062 1. Connects to supervisord using `agents.supervisord.conf`
0063 2. Issues `supervisorctl stop all`
0064 3. All agent processes are terminated
0065
0066 **Important:** `testbed stop-agents` uses `agents.supervisord.conf` (for workflow agents), not `supervisord.conf` (which manages web services).
0067
0068 ## MCP Control Path
0069
0070 ### Agent Manager Daemon
0071
0072 The Agent Manager is a per-user daemon that bridges MCP commands to local supervisord:
0073
0074 ```bash
0075 testbed agent-manager # Run in foreground
0076 nohup testbed agent-manager & # Run in background
0077 ```
0078
0079 The daemon:
0080 - Listens on `/queue/testbed.{username}.control`
0081 - Sends heartbeats to the monitor
0082 - Executes start/stop commands via supervisorctl
0083
0084 ```mermaid
0085 flowchart LR
0086 subgraph Django["Django Monitor"]
0087 MCP["MCP Tool"]
0088 end
0089
0090 subgraph AMQ["ActiveMQ"]
0091 CQ["Control Queue<br/>/queue/testbed.{user}.control"]
0092 end
0093
0094 subgraph Local["User's Session"]
0095 AM["Agent Manager"]
0096 SUPV["supervisord"]
0097 end
0098
0099 MCP -->|"send command"| CQ
0100 CQ -->|"receive"| AM
0101 AM -->|"supervisorctl"| SUPV
0102 AM -.->|"heartbeat"| Django
0103 ```
0104
0105 ### MCP Tools
0106
0107 ```python
0108 # Check agent manager status
0109 check_agent_manager(username)
0110
0111 # Start testbed (agents + workflow runner)
0112 start_user_testbed(username, config_name="testbed.toml")
0113
0114 # Stop all agents
0115 stop_user_testbed(username)
0116
0117 # Start a workflow (after agents are running)
0118 start_workflow(namespace="torre2", stf_count=10)
0119
0120 # Stop a running workflow
0121 stop_workflow(execution_id="stf_datataking-wenauseic-0049")
0122
0123 # Comprehensive status
0124 get_testbed_status(username)
0125 ```
0126
0127 ### MCP Startup Sequence
0128
0129 ```mermaid
0130 sequenceDiagram
0131 participant MCP as MCP Tool
0132 participant AMQ as ActiveMQ
0133 participant AM as Agent Manager
0134 participant Supv as supervisord
0135 participant Agents
0136
0137 MCP->>AMQ: start_testbed command
0138 AMQ->>AM: Receive command
0139 AM->>AM: Load config
0140 AM->>Supv: Check running agents
0141
0142 alt Agents already running
0143 AM-->>MCP: Error: agents running
0144 else No agents running
0145 AM->>Supv: Restart supervisord
0146 AM->>Supv: Start all agents
0147 Supv->>Agents: Launch processes
0148 AM-->>MCP: Success
0149 end
0150 ```
0151
0152 ## Configuration
0153
0154 ### testbed.toml Structure
0155
0156 ```toml
0157 [testbed]
0158 namespace = "torre2" # Isolation namespace
0159
0160 [agents.data]
0161 enabled = true # Enable this agent
0162 script = "example_agents/example_data_agent.py"
0163
0164 [agents.fastmon]
0165 enabled = true
0166
0167 [agents.fast_processing]
0168 enabled = true
0169
0170 [workflow]
0171 name = "stf_datataking" # Default workflow
0172 config = "fast_processing_default"
0173 realtime = true
0174 ```
0175
0176 ### Environment Variables
0177
0178 Supervisord passes environment variables to agents:
0179
0180 ```ini
0181 # agents.supervisord.conf
0182 [program:example-data-agent]
0183 command=python example_agents/example_data_agent.py
0184 directory=%(ENV_SWF_HOME)s/swf-testbed
0185 environment=SWF_TESTBED_CONFIG="%(ENV_SWF_TESTBED_CONFIG)s"
0186 autostart=false
0187 autorestart=true
0188 ```
0189
0190 Key variables:
0191 | Variable | Purpose |
0192 |----------|---------|
0193 | `SWF_HOME` | Parent directory containing swf-* repos |
0194 | `SWF_TESTBED_CONFIG` | Path to testbed.toml |
0195 | `SWF_MONITOR_HTTP_URL` | Monitor REST API URL |
0196 | `SWF_API_TOKEN` | API authentication token |
0197
0198 **Important:** Supervisord must be restarted to pick up environment variable changes. Both CLI and MCP paths automatically restart supervisord on start.
0199
0200 ## Process Lifecycle
0201
0202 ```mermaid
0203 stateDiagram-v2
0204 [*] --> Stopped: Initial
0205
0206 Stopped --> Starting: testbed run /<br/>start_user_testbed
0207 Starting --> Running: Agents started
0208 Running --> Stopping: testbed stop-agents /<br/>stop_user_testbed
0209 Stopping --> Stopped: Agents stopped
0210
0211 Running --> Running: start_workflow /<br/>stop_workflow
0212
0213 note right of Starting
0214 1. Check no agents running
0215 2. Restart supervisord
0216 3. Start enabled agents
0217 end note
0218 ```
0219
0220 ## Troubleshooting
0221
0222 ### Agents Won't Start: "agents already running"
0223
0224 ```bash
0225 # Check what's running
0226 supervisorctl -c agents.supervisord.conf status
0227
0228 # Stop existing agents
0229 testbed stop-agents
0230
0231 # Now start fresh
0232 testbed run
0233 ```
0234
0235 ### Wrong Namespace
0236
0237 Agents use the namespace from the config file that was active when supervisord started. If agents are using the wrong namespace:
0238
0239 ```bash
0240 testbed stop-agents
0241 # Edit testbed.toml or set SWF_TESTBED_CONFIG
0242 testbed run
0243 ```
0244
0245 The restart of supervisord picks up the new configuration.
0246
0247 ### Agent Manager Not Responding
0248
0249 ```bash
0250 # Check if agent manager is running
0251 ps aux | grep user_agent_manager
0252
0253 # Check MCP status
0254 # (via MCP tool)
0255 check_agent_manager(username)
0256
0257 # Restart agent manager
0258 pkill -f user_agent_manager
0259 nohup testbed agent-manager &
0260 ```
0261
0262 ## See Also
0263
0264 - [Architecture Overview](architecture.md) - System design and components
0265 - [Fast Processing Workflow](fast-processing-workflow.md) - Workflow sequence diagram
0266 - [Operations Guide](operations.md) - Day-to-day operations