Warning, /swf-monitor/docs/MCP_STABILIZATION_PLAN.md is written in an unsupported language. File is not indexed.
0001 # MCP Stabilization Plan
0002
0003 Date: 2026-04-27
0004
0005 ## Context
0006
0007 The swf-monitor MCP endpoint has repeatedly locked up on swf-testbed. The
0008 current production service runs uvicorn with 20 worker processes for
0009 `/swf-monitor/mcp/`, fronted by Apache. Increasing the worker count did not fix
0010 the failure mode: on 2026-04-27 the service became nonresponsive, Apache
0011 returned MCP backend 502s, the PanDA bot timed out waiting for MCP
0012 `initialize`, and systemd had to SIGKILL all uvicorn workers after graceful
0013 shutdown failed.
0014
0015 This is not primarily a concurrency shortage. It is an MCP transport and
0016 lifecycle problem.
0017
0018 ## Appraisal
0019
0020 The installed `django-mcp-server` adapter creates, runs, and shuts down a fresh
0021 `StreamableHTTPSessionManager` per Django request via `async_to_sync`. The MCP
0022 SDK documents `StreamableHTTPSessionManager` as an application-lifetime object
0023 that should be created once and run once. The current adapter may work for short
0024 JSON POST requests, but it is a poor fit for long-lived streamable HTTP/SSE
0025 sessions.
0026
0027 The deployment also claims streaming HTTP support, but GET/SSE behavior is not
0028 healthy in practice. Direct GET requests to `/swf-monitor/mcp/` return 400 or
0029 406 depending on headers, and logs show clients attempting GET and receiving
0030 406. The system is paying the operational complexity cost of streamable HTTP
0031 without delivering robust streaming semantics.
0032
0033 On swf-testbed, swf-monitor MCP is used locally. There is no current operational
0034 requirement for remote MCP clients to maintain long-lived streaming sessions.
0035 The useful surface is request/response tool invocation: `initialize`,
0036 `tools/list`, and `tools/call`.
0037
0038 ## Streaming Clarification
0039
0040 The proposed stabilization does not replace MCP with arbitrary REST and does
0041 not break the MCP JSON-RPC tool surface. The endpoint should remain MCP over
0042 HTTP for clients that use POST request/response calls.
0043
0044 The change is to stop depending on long-lived GET/SSE streaming and server-side
0045 MCP session state until there is a proven use case and a correct ASGI
0046 implementation.
0047
0048 Client implications:
0049
0050 - PanDA bot, testbed bot, and local scripts that use POST-only JSON-RPC should
0051 continue to work.
0052 - Because swf-monitor MCP is only used locally here, there is no known external
0053 user depending on GET/SSE streaming.
0054 - A generic MCP client that insists on stateful streamable HTTP sessions or
0055 working GET/SSE streams could fail, but that behavior is already unreliable
0056 today.
0057 - If streaming becomes valuable later, reimplement it deliberately with an
0058 ASGI app that owns a lifespan-managed `StreamableHTTPSessionManager`, rather
0059 than the current Django APIView bridge.
0060
0061 ## Repair Plan
0062
0063 1. Make MCP stateless request/response first.
0064
0065 Set `DJANGO_MCP_GLOBAL_SERVER_CONFIG["stateless"] = True`, stop issuing
0066 Django-backed MCP session IDs, and reduce uvicorn from 20 workers to a small
0067 count such as 2 or 4. The current tools are database/API primitives and do
0068 not require server-side MCP session state.
0069
0070 Authentication implication: low risk. Current MCP authentication is request
0071 middleware, not MCP session state. Bearer token validation still happens per
0072 request. Local unauthenticated POSTs continue to pass unless that policy is
0073 changed explicitly.
0074
0075 2. Stop advertising streaming as an operational dependency.
0076
0077 Document the local MCP deployment as POST request/response MCP. Do not
0078 promise GET/SSE support until it is deliberately reimplemented and tested.
0079
0080 3. Prefer local REST-style HTTP for bots over Python service coupling.
0081
0082 Short term, point local bot MCP traffic at
0083 `http://127.0.0.1:8001/swf-monitor/mcp/` to bypass Apache/HTTPS for
0084 same-host calls.
0085
0086 Better follow-up: add local REST endpoints for bot memory and common
0087 SWF/PanDA queries so bots can remain mobile without importing Django service
0088 functions directly. Keep MCP as an AI integration surface, not mandatory
0089 loopback plumbing for every local bot operation.
0090
0091 4. Add guardrails and observability.
0092
0093 Add uvicorn concurrency limits, shorter graceful shutdown, a cheap
0094 non-MCP health endpoint, request start/end timing logs, and a watchdog that
0095 restarts `swf-monitor-mcp-asgi.service` if `initialize` or `tools/list`
0096 exceeds a small threshold.
0097
0098 5. Fix async logging.
0099
0100 `DbLogHandler` currently writes to `AppLog` through synchronous Django ORM
0101 calls even when invoked from async contexts, producing repeated
0102 `You cannot call this from an async context` errors. Replace it with an
0103 async-safe queue/background-writer design, or disable DB logging in async
0104 uvicorn/bot processes until the queue writer exists. Logging failures must
0105 be visible; silent failure is not acceptable.
0106
0107 6. Fix bot exception paths.
0108
0109 Initialize PanDA bot metadata variables before any MCP call can fail. On MCP
0110 failure, return a clear Mattermost-visible error and log the exception. Do
0111 the same review for testbed bot paths. A failed MCP request must not produce
0112 hidden task exceptions such as `UnboundLocalError`.
0113
0114 ## Verification
0115
0116 Before deployment:
0117
0118 - Direct local MCP POST tests: `initialize`, `tools/list`, representative
0119 `swf_*`, `panda_*`, and `pcs_*` calls.
0120 - Apache-proxied MCP POST tests.
0121 - Bot smoke tests against the local endpoint.
0122 - Confirm GET/SSE behavior is intentionally unsupported/documented, or fully
0123 implemented later.
0124 - Confirm `systemctl restart swf-monitor-mcp-asgi.service` exits cleanly
0125 without SIGKILL.
0126 - Confirm async logging no longer emits Django async-context errors.
0127
0128 ## Execution Order
0129
0130 Phase 1 should be a stabilization patch: stateless MCP, reduced worker count,
0131 local bot endpoint, async logging fix, and bot exception-path fixes.
0132
0133 Phase 2 should add broader observability: health endpoint, watchdog, timing
0134 logs, and documentation cleanup.