Back to home page

EIC code displayed by LXR

 
 

    


Warning, /swf-testbed/docs/agent-management.md is written in an unsupported language. File is not indexed.

0001 # Agent Management
0002 
0003 This document describes how workflow agents are started, stopped, and managed in the ePIC Streaming Workflow Testbed.
0004 
0005 ## Overview
0006 
0007 Agents are managed through two control paths:
0008 - **CLI** (`testbed` command) - for local operation
0009 - **MCP** (Model Context Protocol) - for AI-assisted remote operation
0010 
0011 Both paths use supervisord for process management, ensuring consistent behavior.
0012 
0013 ![Agent Management Overview](images/agent-management-overview-v4.svg)
0014 
0015 ## CLI Control Path
0016 
0017 ### Starting Agents and Workflows
0018 
0019 ```bash
0020 testbed run                     # Uses workflows/testbed.toml
0021 testbed run fast_processing     # Uses workflows/fast_processing_default.toml
0022 ```
0023 
0024 **Startup sequence:**
0025 
0026 ```mermaid
0027 sequenceDiagram
0028     participant User
0029     participant CLI as testbed CLI
0030     participant Orch as orchestrator.py
0031     participant Supv as supervisord
0032     participant Agents
0033 
0034     User->>CLI: testbed run
0035     CLI->>Orch: run(config_name)
0036     Orch->>Orch: Load config (testbed.toml)
0037     Orch->>Supv: Check running agents
0038 
0039     alt Agents already running
0040         Supv-->>Orch: Running agents list
0041         Orch-->>CLI: Error: agents running
0042         CLI-->>User: "Run testbed stop-agents first"
0043     else No agents running
0044         Orch->>Supv: Restart supervisord
0045         Note over Supv: Picks up fresh env vars
0046         Orch->>Supv: Start workflow-runner
0047         Orch->>Supv: Start enabled agents
0048         Supv->>Agents: Launch processes
0049         Orch->>Agents: Send run_workflow command
0050         Orch-->>CLI: Success
0051         CLI-->>User: "Workflow triggered"
0052     end
0053 ```
0054 
0055 ### Stopping Agents
0056 
0057 ```bash
0058 testbed stop-agents             # Stop all workflow agents
0059 ```
0060 
0061 This command:
0062 1. Connects to supervisord using `agents.supervisord.conf`
0063 2. Issues `supervisorctl stop all`
0064 3. All agent processes are terminated
0065 
0066 **Important:** `testbed stop-agents` uses `agents.supervisord.conf` (for workflow agents), not `supervisord.conf` (which manages web services).
0067 
0068 ## MCP Control Path
0069 
0070 ### Agent Manager Daemon
0071 
0072 The Agent Manager is a per-user daemon that bridges MCP commands to local supervisord:
0073 
0074 ```bash
0075 testbed agent-manager           # Run in foreground
0076 nohup testbed agent-manager &   # Run in background
0077 ```
0078 
0079 The daemon:
0080 - Listens on `/queue/testbed.{username}.control`
0081 - Sends heartbeats to the monitor
0082 - Executes start/stop commands via supervisorctl
0083 
0084 ```mermaid
0085 flowchart LR
0086     subgraph Django["Django Monitor"]
0087         MCP["MCP Tool"]
0088     end
0089 
0090     subgraph AMQ["ActiveMQ"]
0091         CQ["Control Queue<br/>/queue/testbed.{user}.control"]
0092     end
0093 
0094     subgraph Local["User's Session"]
0095         AM["Agent Manager"]
0096         SUPV["supervisord"]
0097     end
0098 
0099     MCP -->|"send command"| CQ
0100     CQ -->|"receive"| AM
0101     AM -->|"supervisorctl"| SUPV
0102     AM -.->|"heartbeat"| Django
0103 ```
0104 
0105 ### MCP Tools
0106 
0107 ```python
0108 # Check agent manager status
0109 check_agent_manager(username)
0110 
0111 # Start testbed (agents + workflow runner)
0112 start_user_testbed(username, config_name="testbed.toml")
0113 
0114 # Stop all agents
0115 stop_user_testbed(username)
0116 
0117 # Start a workflow (after agents are running)
0118 start_workflow(namespace="torre2", stf_count=10)
0119 
0120 # Stop a running workflow
0121 stop_workflow(execution_id="stf_datataking-wenauseic-0049")
0122 
0123 # Comprehensive status
0124 get_testbed_status(username)
0125 ```
0126 
0127 ### MCP Startup Sequence
0128 
0129 ```mermaid
0130 sequenceDiagram
0131     participant MCP as MCP Tool
0132     participant AMQ as ActiveMQ
0133     participant AM as Agent Manager
0134     participant Supv as supervisord
0135     participant Agents
0136 
0137     MCP->>AMQ: start_testbed command
0138     AMQ->>AM: Receive command
0139     AM->>AM: Load config
0140     AM->>Supv: Check running agents
0141 
0142     alt Agents already running
0143         AM-->>MCP: Error: agents running
0144     else No agents running
0145         AM->>Supv: Restart supervisord
0146         AM->>Supv: Start all agents
0147         Supv->>Agents: Launch processes
0148         AM-->>MCP: Success
0149     end
0150 ```
0151 
0152 ## Configuration
0153 
0154 ### testbed.toml Structure
0155 
0156 ```toml
0157 [testbed]
0158 namespace = "torre2"              # Isolation namespace
0159 
0160 [agents.data]
0161 enabled = true                    # Enable this agent
0162 script = "example_agents/example_data_agent.py"
0163 
0164 [agents.fastmon]
0165 enabled = true
0166 
0167 [agents.fast_processing]
0168 enabled = true
0169 
0170 [workflow]
0171 name = "stf_datataking"           # Default workflow
0172 config = "fast_processing_default"
0173 realtime = true
0174 ```
0175 
0176 ### Environment Variables
0177 
0178 Supervisord passes environment variables to agents:
0179 
0180 ```ini
0181 # agents.supervisord.conf
0182 [program:example-data-agent]
0183 command=python example_agents/example_data_agent.py
0184 directory=%(ENV_SWF_HOME)s/swf-testbed
0185 environment=SWF_TESTBED_CONFIG="%(ENV_SWF_TESTBED_CONFIG)s"
0186 autostart=false
0187 autorestart=true
0188 ```
0189 
0190 Key variables:
0191 | Variable | Purpose |
0192 |----------|---------|
0193 | `SWF_HOME` | Parent directory containing swf-* repos |
0194 | `SWF_TESTBED_CONFIG` | Path to testbed.toml |
0195 | `SWF_MONITOR_HTTP_URL` | Monitor REST API URL |
0196 | `SWF_API_TOKEN` | API authentication token |
0197 
0198 **Important:** Supervisord must be restarted to pick up environment variable changes. Both CLI and MCP paths automatically restart supervisord on start.
0199 
0200 ## Process Lifecycle
0201 
0202 ```mermaid
0203 stateDiagram-v2
0204     [*] --> Stopped: Initial
0205 
0206     Stopped --> Starting: testbed run /<br/>start_user_testbed
0207     Starting --> Running: Agents started
0208     Running --> Stopping: testbed stop-agents /<br/>stop_user_testbed
0209     Stopping --> Stopped: Agents stopped
0210 
0211     Running --> Running: start_workflow /<br/>stop_workflow
0212 
0213     note right of Starting
0214         1. Check no agents running
0215         2. Restart supervisord
0216         3. Start enabled agents
0217     end note
0218 ```
0219 
0220 ## Troubleshooting
0221 
0222 ### Agents Won't Start: "agents already running"
0223 
0224 ```bash
0225 # Check what's running
0226 supervisorctl -c agents.supervisord.conf status
0227 
0228 # Stop existing agents
0229 testbed stop-agents
0230 
0231 # Now start fresh
0232 testbed run
0233 ```
0234 
0235 ### Wrong Namespace
0236 
0237 Agents use the namespace from the config file that was active when supervisord started. If agents are using the wrong namespace:
0238 
0239 ```bash
0240 testbed stop-agents
0241 # Edit testbed.toml or set SWF_TESTBED_CONFIG
0242 testbed run
0243 ```
0244 
0245 The restart of supervisord picks up the new configuration.
0246 
0247 ### Agent Manager Not Responding
0248 
0249 ```bash
0250 # Check if agent manager is running
0251 ps aux | grep user_agent_manager
0252 
0253 # Check MCP status
0254 # (via MCP tool)
0255 check_agent_manager(username)
0256 
0257 # Restart agent manager
0258 pkill -f user_agent_manager
0259 nohup testbed agent-manager &
0260 ```
0261 
0262 ## See Also
0263 
0264 - [Architecture Overview](architecture.md) - System design and components
0265 - [Fast Processing Workflow](fast-processing-workflow.md) - Workflow sequence diagram
0266 - [Operations Guide](operations.md) - Day-to-day operations