Skip to content

ADR-007: AEGIS AWS Deployment Architecture

Status: Accepted Date: 2026-02-10 Decision Makers: joshuakirby Supersedes: None (first deployment architecture decision)

Context

AEGIS Governance is a production-complete SDK (1859 tests, 94.55% coverage at the time of this ADR) with CDK infrastructure defined but not yet deployed. All code — CLI, MCP server, Prometheus exporter, Grafana dashboards, Docker Compose — exists and works locally. Five ROADMAP items (17-20a) are blocked on "infrastructure" because no AWS resources have been provisioned.

Key Constraints

  1. MCP server cannot run on Lambda — MCP servers use stdio or long-lived HTTP connections incompatible with Lambda's request-response model (confirmed via AWS re:Post and Cloudflare MCP docs)
  2. Gate evaluation is request-responsepcw_decide() takes a context, evaluates gates, returns a decision in <1 second
  3. Expected traffic: ~1,770 requests/month (Lambda cost breakeven vs ECS is ~145K requests/month)
  4. AEGIS must be a standalone product — not a Libertas-Core subsystem
  5. Libertas-Core has existing VPC/KMS/AMP — shared infrastructure is available

Decision

Deploy AEGIS as a hybrid Lambda + ECS architecture:

  • Tier 1 (Lambda): Gate evaluation (pcw_decide()) behind API Gateway with IAM auth
  • Tier 2 (ECS Fargate): MCP server as a long-lived container with HTTP transport, internal ALB, and ADOT sidecar for Prometheus → AMP
  • Tier 3 (pip): SDK distribution via PyPI (pip install aegis-governance)

Architecture Diagram

API Gateway (REST, IAM auth)
  ├── POST /evaluate   → Lambda (aegis-evaluate)
  ├── POST /risk-check → Lambda
  └── GET  /health     → Lambda

Internal ALB (:80) → ECS Fargate
                       ├── aegis-mcp-server (HTTP :8080, metrics :9090)
                       │     ├── POST /mcp     (JSON-RPC 2.0, single + batch)
                       │     ├── GET  /health  (200 OK)
                       │     └── GET  /mcp     (405 — SSE not implemented)
                       └── adot-collector (sidecar → AMP remote write) [optional]

DynamoDB (workflow state)
Secrets Manager (BIP-322 keys)
S3 (audit logs)

Alternatives Considered

1. Lambda-Only

  • Pro: Simplest, cheapest for low traffic
  • Con: Cannot run MCP server (stdio/streaming incompatible with Lambda)
  • Rejected: MCP is a core deployment target

2. ECS-Only

  • Pro: Single compute model, simpler architecture
  • Con: $30-46/month for ECS vs $11/month Lambda at 2K requests; idle compute waste
  • Rejected: Lambda is 85x cheaper for gate evaluation at projected volume

3. EKS (Kubernetes)

  • Pro: Industry standard for containerized workloads
  • Con: Massive overhead for single-service deployment; control plane ~$73/month
  • Rejected: Overkill for current scale

4. App Runner

  • Pro: Zero infrastructure management
  • Con: No VPC integration (needed for DynamoDB VPC endpoints); limited observability
  • Rejected: Missing required features

Consequences

Positive

  • Cost-efficient: ~$51/month total (Lambda + ECS + storage)
  • Separation of concerns: Gate eval (stateless, fast) vs MCP (stateful, long-lived)
  • Standalone product: Any repo can use the governance gate action
  • Shared infrastructure: Leverages Libertas VPC/KMS/AMP without duplication

Negative

  • Two compute models: Lambda + ECS increases operational surface
  • CDK complexity: Four stacks instead of one
  • Cold starts: Lambda cold start (3-5s with scipy) affects first request

Mitigations

  • CDK stacks are self-contained with clear dependency chain
  • Lambda cold start acceptable for async governance gates (not latency-critical)
  • Provisioned concurrency available if cold start becomes an issue ($11/mo)

Cost Analysis

Component Monthly Annual
Lambda (512MB, ~2K invocations) $11 $132
ECS Fargate (0.25 vCPU, 24/7) $30 $360
API Gateway (2K requests) $0.01 $0.12
DynamoDB (on-demand, <1GB) $5 $60
Secrets Manager (5 secrets) $2 $24
CloudWatch Logs $2 $24
S3 (audit logs, <10GB) $0.23 $2.76
AMP (shared, incremental) $0.30 $3.60
Total ~$51/mo ~$607/yr

Implementation

Infrastructure is defined in CDK (Python) under infra/:

  1. AegisSharedStack — DynamoDB, Secrets Manager, S3, KMS
  2. AegisLambdaStack — Lambda function, API Gateway, IAM
  3. AegisMcpStack — ECS cluster, Fargate service, ADOT sidecar
  4. AegisMonitoringStack — CloudWatch alarms, dashboard, SNS topic

References