Warning, /swf-monitor/docs/MCP_STABILIZATION_STATUS.md is written in an unsupported language. File is not indexed.
0001 # MCP Stabilization Status
0002
0003 Date: 2026-04-27
0004
0005 Related plan: [MCP_STABILIZATION_PLAN.md](MCP_STABILIZATION_PLAN.md)
0006
0007 ## Summary
0008
0009 The first stabilization pass has been implemented, committed, pushed, deployed,
0010 and verified on swf-testbed.
0011
0012 Commit:
0013
0014 ```text
0015 ed87bd8 Stabilize local MCP service operation
0016 ```
0017
0018 The live PanDA bot and testbed bot now initialize MCP successfully through the
0019 local loopback ASGI endpoint.
0020
0021 ## Current Operating Model
0022
0023 MCP on swf-testbed is operated as local stateless POST request/response MCP:
0024
0025 - Local endpoint: `http://127.0.0.1:8001/swf-monitor/mcp/`
0026 - Public Apache path remains present for routing, but long-lived GET/SSE
0027 streaming is not an operational dependency.
0028 - `DJANGO_MCP_GLOBAL_SERVER_CONFIG["stateless"] = True`
0029 - The ASGI service remains isolated from the main mod_wsgi Django site.
0030
0031 The useful supported surface is:
0032
0033 - `initialize`
0034 - `tools/list`
0035 - `tools/call`
0036
0037 ## Implemented Changes
0038
0039 1. MCP stateless mode enabled.
0040
0041 `src/swf_monitor_project/settings.py` now sets:
0042
0043 ```python
0044 "stateless": True
0045 ```
0046
0047 2. ASGI worker bounded.
0048
0049 `swf-monitor-mcp-asgi.service` now runs uvicorn with:
0050
0051 ```text
0052 --workers 4
0053 --limit-concurrency 32
0054 --timeout-graceful-shutdown 15
0055 ```
0056
0057 `TimeoutStopSec=30` is also set at the systemd layer.
0058
0059 3. Apache MCP proxy simplified.
0060
0061 `apache-swf-monitor.conf` now treats MCP as bounded request/response
0062 traffic:
0063
0064 - proxy timeout reduced from 3600s to 60s
0065 - streaming-specific `proxy-sendchunked` and `no-gzip` settings removed
0066
0067 4. Bots moved to local MCP by default.
0068
0069 PanDA bot and testbed bot default to:
0070
0071 ```text
0072 http://127.0.0.1:8001/swf-monitor/mcp/
0073 ```
0074
0075 5. Bot exception paths fixed.
0076
0077 PanDA bot initializes tool metadata before MCP calls can fail, preventing
0078 the observed secondary `UnboundLocalError`. PanDA and testbed bots now log
0079 unexpected response-task exceptions and return a visible Mattermost error
0080 instead of failing silently.
0081
0082 6. Async DB logging fixed.
0083
0084 `DbLogHandler` no longer writes Django ORM records directly from the caller
0085 context. It now queues log payloads and writes them from a background thread,
0086 avoiding Django async-context ORM errors.
0087
0088 7. MCP health endpoint added.
0089
0090 New endpoint:
0091
0092 ```text
0093 /swf-monitor/api/mcp-health/
0094 ```
0095
0096 It verifies that Django can serve a request and reach the default database
0097 without invoking MCP transport/session code.
0098
0099 8. MCP watchdog added.
0100
0101 New files:
0102
0103 - `scripts/mcp_watchdog.py`
0104 - `swf-monitor-mcp-watchdog.service`
0105 - `swf-monitor-mcp-watchdog.timer`
0106
0107 The watchdog checks:
0108
0109 - MCP health endpoint
0110 - MCP `initialize`
0111 - MCP `tools/list`
0112
0113 When run by the timer with `--restart`, it restarts
0114 `swf-monitor-mcp-asgi.service` after a failed probe.
0115
0116 9. Documentation updated.
0117
0118 `docs/MCP.md` and `docs/PRODUCTION_DEPLOYMENT.md` now describe the local
0119 stateless request/response operating model and the watchdog.
0120
0121 ## Deployment Performed
0122
0123 Deployment command used:
0124
0125 ```bash
0126 sudo /opt/swf-monitor/bin/deploy-swf-monitor.sh branch infra/baseline-v35
0127 ```
0128
0129 Deployment result:
0130
0131 - release: `branch-infra-baseline-v35`
0132 - deployed commit: `ed87bd8`
0133 - Apache health check: passed
0134 - Apache configuration synced from repository canonical
0135 - ASGI worker restarted by deploy script
0136 - PanDA bot restarted by deploy script
0137 - testbed bot restarted by deploy script
0138
0139 Additional manual systemd unit sync was required because the deploy script
0140 does not install service unit definitions:
0141
0142 ```bash
0143 sudo install -o root -g root -m 644 /opt/swf-monitor/current/swf-monitor-mcp-asgi.service /etc/systemd/system/swf-monitor-mcp-asgi.service
0144 sudo install -o root -g root -m 644 /opt/swf-monitor/current/swf-monitor-mcp-watchdog.service /etc/systemd/system/swf-monitor-mcp-watchdog.service
0145 sudo install -o root -g root -m 644 /opt/swf-monitor/current/swf-monitor-mcp-watchdog.timer /etc/systemd/system/swf-monitor-mcp-watchdog.timer
0146 sudo systemctl daemon-reload
0147 sudo systemctl restart swf-monitor-mcp-asgi.service
0148 sudo systemctl enable --now swf-monitor-mcp-watchdog.timer
0149 ```
0150
0151 ## Live Verification
0152
0153 Verified after deployment:
0154
0155 - `swf-monitor-mcp-asgi.service`: active
0156 - `swf-panda-bot.service`: active
0157 - `swf-testbed-bot.service`: active
0158 - `swf-monitor-mcp-watchdog.timer`: active
0159 - MCP health endpoint returned:
0160
0161 ```json
0162 {
0163 "ok": true,
0164 "service": "swf-monitor-mcp-asgi",
0165 "database": "ok",
0166 "mcp_stateless": true
0167 }
0168 ```
0169
0170 - Watchdog direct probe returned:
0171
0172 ```text
0173 MCP watchdog OK: 45 tools
0174 ```
0175
0176 - PanDA bot journal showed:
0177
0178 ```text
0179 Listening on #pandabot ... MCP: http://127.0.0.1:8001/swf-monitor/mcp/
0180 HTTP MCP: 13 tools
0181 ```
0182
0183 - Testbed bot journal showed:
0184
0185 ```text
0186 Listening on #swf-testbed-bot + DMs (MCP: http://127.0.0.1:8001/swf-monitor/mcp/)
0187 Discovered 44 tools via MCP
0188 ```
0189
0190 User confirmed the bot MCP path is working after deployment.
0191
0192 ## Notes And Caveats
0193
0194 The current `django-mcp-server` adapter still creates and shuts down a
0195 `StreamableHTTPSessionManager` per request internally. Stateless mode removes
0196 server-side MCP session dependence and keeps requests short, but it is not a
0197 full replacement for a correct lifespan-managed ASGI MCP implementation.
0198
0199 One watchdog run failed during the exact ASGI restart window and restarted the
0200 ASGI service. A subsequent manual watchdog run succeeded. This is expected for
0201 the initial enable/restart sequence.
0202
0203 The PanDA bot performs model/cache initialization on startup and may show high
0204 CPU briefly after restart. This is separate from the MCP ASGI worker lockup
0205 problem.
0206
0207 ## Remaining Follow-Up
0208
0209 1. Monitor journals and process CPU after sustained use.
0210
0211 Watch:
0212
0213 ```bash
0214 sudo journalctl -u swf-monitor-mcp-asgi.service -f
0215 sudo journalctl -u swf-monitor-mcp-watchdog.service -f
0216 sudo journalctl -u swf-panda-bot.service -f
0217 ```
0218
0219 2. Consider updating the deploy script to install changed systemd unit files
0220 automatically, with validation before reload.
0221
0222 3. Consider optimizing deployment venv handling. The current deploy script
0223 copies the virtual environment every deploy because releases are
0224 self-contained; in principle the venv only needs to change when dependency
0225 inputs change.
0226
0227 4. If remote MCP or GET/SSE streaming becomes a real requirement, implement it
0228 as a dedicated ASGI app with an application-lifetime
0229 `StreamableHTTPSessionManager` and load-test it before advertising support.
0230
0231 5. Add request timing metrics around MCP calls if the endpoint still shows
0232 unexplained stalls under normal local bot usage.