Skip to content

AEGIS Production Deployment Guide

Version: 1.0.0 | Updated: 2026-02-09 | Status: Active

This guide covers deploying AEGIS in production environments including Docker, Kubernetes, and AWS.


1. Prerequisites

  • Python: 3.9+ (3.11 recommended for production)
  • pip: 21.0+
  • OS: Linux (Ubuntu 22.04+, Amazon Linux 2023) or macOS
  • Optional: Docker 24+, Kubernetes 1.28+, PostgreSQL 15+

Deployment Profiles

Profile Install Command Use Case
Minimal pip install aegis-governance Evaluation only, zero dependencies
Standard pip install aegis-governance[engine,telemetry] Production with metrics + scipy z-scores
Full pip install aegis-governance[all] All features including crypto + persistence
PQ-Hardened pip install aegis-governance[crypto,pqc,persistence] Post-quantum signatures + durable state

2. Installation Profiles

Minimal (Zero Dependencies)

pip install aegis-governance

Provides: pcw_decide(), CLI (aegis evaluate), gate evaluation, Bayesian posterior. No scipy (z-scores unavailable), no Prometheus metrics, no YAML config loading.

pip install aegis-governance[engine,telemetry,config]

Adds: - engine: scipy for utility z-score computation - telemetry: prometheus_client for Prometheus metrics exporter - config: pyyaml for YAML configuration loading

Full

pip install aegis-governance[all]

Adds all optional groups: engine, telemetry, config, mcp, crypto, pqc, persistence.

Optional Dependency Groups

Group Package(s) Purpose
engine scipy Utility z-score computation
telemetry prometheus_client Prometheus metrics exporter
config pyyaml YAML configuration loading
mcp pyyaml MCP server for AI agents
crypto btclib, coincurve BIP-340 Schnorr signatures
pqc liboqs-python ML-DSA-44, ML-KEM-768 (requires native liboqs)
persistence sqlalchemy, asyncpg, aiosqlite Durable workflow state

3. Configuration

Default Configuration

from aegis_governance import AegisConfig

# Uses frozen defaults matching schema/interface-contract.yaml
config = AegisConfig.default()

YAML Configuration

# Requires: pip install aegis-governance[config]
config = AegisConfig.from_yaml("config.yaml")

Example config.yaml:

parameters:
  epsilon_R: 0.01
  epsilon_P: 0.01
  risk_trigger_factor: 2.0
  profit_trigger_factor: 2.0
  trigger_confidence_prob: 0.95
  novelty_gate:
    N0: 0.7
    k: 10.0
    output_threshold: 0.8
  complexity_floor: 0.5
  quality_min_score: 0.7

Dict Configuration

config = AegisConfig.from_dict({"epsilon_R": 0.02, "quality_min_score": 0.8})

Environment Variables

Variable Purpose Default
AEGIS_CONFIG_PATH Path to YAML config file None (uses defaults)
AEGIS_METRICS_PORT Metrics server port 9090
AEGIS_LOG_LEVEL Logging level INFO
DATABASE_URL PostgreSQL connection string None (in-memory)

Frozen Parameter Policy

schema/interface-contract.yaml is the authoritative source for parameter values. AegisConfig defaults match this file exactly. Runtime mutation is impossible (frozen dataclass). Parameter changes require formal recalibration approval.


4. Docker Deployment

Dockerfile

The repository includes a multi-stage Dockerfile:

docker build -t aegis-governance .

Key features: - Multi-stage build (smaller image) - Non-root user (aegis, UID 1000) - Health check via aegis version - Exposes port 9090 for metrics

Docker Compose

Start AEGIS with Prometheus and Grafana:

docker compose up -d

Services:

Service Port Purpose
aegis 9090 AEGIS metrics server
prometheus 9091 Prometheus monitoring
grafana 3000 Grafana dashboards

Environment Variables (Docker)

Variable Default Purpose
AEGIS_CONFIG_PATH /app/schema/interface-contract.yaml Config file path
AEGIS_METRICS_PORT 9090 Metrics endpoint port
DATABASE_URL None PostgreSQL connection string

Volume Mounts

Container Path Purpose
/app/schema/ Configuration schemas (read-only)
/app/config/ Custom configuration (optional)

5. Kubernetes Deployment

Deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  name: aegis-governance
  labels:
    app: aegis-governance
spec:
  replicas: 2
  selector:
    matchLabels:
      app: aegis-governance
  template:
    metadata:
      labels:
        app: aegis-governance
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "9090"
        prometheus.io/path: "/metrics"
    spec:
      securityContext:
        runAsNonRoot: true
        runAsUser: 1000
      containers:
        - name: aegis
          image: aegis-governance:latest
          command: ["python", "-c", "import signal; from telemetry.metrics_server import MetricsServer; s=MetricsServer(host='0.0.0.0',port=9090); s.start(); signal.sigwait({signal.SIGTERM,signal.SIGINT}); s.stop()"]
          ports:
            - containerPort: 9090
              name: metrics
          resources:
            requests:
              cpu: "100m"
              memory: "128Mi"
            limits:
              cpu: "500m"
              memory: "256Mi"
          livenessProbe:
            exec:
              command: ["aegis", "version"]
            initialDelaySeconds: 5
            periodSeconds: 30
          readinessProbe:
            httpGet:
              path: /metrics
              port: 9090
            initialDelaySeconds: 5
            periodSeconds: 10
          env:
            - name: AEGIS_CONFIG_PATH
              value: "/app/config/config.yaml"
            - name: DATABASE_URL
              valueFrom:
                secretKeyRef:
                  name: aegis-secrets
                  key: database-url
                  optional: true
          volumeMounts:
            - name: config
              mountPath: /app/config
              readOnly: true
      volumes:
        - name: config
          configMap:
            name: aegis-config

ConfigMap

apiVersion: v1
kind: ConfigMap
metadata:
  name: aegis-config
data:
  config.yaml: |
    parameters:
      epsilon_R: 0.01
      epsilon_P: 0.01
      risk_trigger_factor: 2.0
      profit_trigger_factor: 2.0
      trigger_confidence_prob: 0.95

Secret

apiVersion: v1
kind: Secret
metadata:
  name: aegis-secrets
type: Opaque
stringData:
  database-url: "postgresql+asyncpg://{USER}:{PASSWORD}@{HOST}:5432/aegis"
  # HSM credentials (if PQ-hardened profile)
  hsm-pin: "{HSM_PIN}"

Service + ServiceMonitor

apiVersion: v1
kind: Service
metadata:
  name: aegis-governance
  labels:
    app: aegis-governance
spec:
  selector:
    app: aegis-governance
  ports:
    - name: metrics
      port: 9090
      targetPort: 9090
---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: aegis-governance
  labels:
    release: prometheus
spec:
  selector:
    matchLabels:
      app: aegis-governance
  endpoints:
    - port: metrics
      interval: 30s
      path: /metrics

6. AWS Deployment

Lambda (Single Evaluation)

Use Lambda for on-demand proposal evaluation:

import json
from aegis_governance import AegisConfig, PCWContext, PCWPhase, pcw_decide

config = AegisConfig.default()
evaluator = config.create_gate_evaluator()

def handler(event, context):
    ctx = PCWContext(
        agent_id=event.get("agent_id", "lambda"),
        session_id=context.aws_request_id,
        phase=PCWPhase.PLAN,
        proposal_summary=event["proposal_summary"],
        estimated_impact=event.get("estimated_impact", "medium"),
        risk_proposed=event.get("risk_score", 0.1),
        complexity_score=event.get("complexity_score", 0.5),
        quality_score=event.get("quality_score", 0.8),
    )
    decision = pcw_decide(ctx, gate_evaluator=evaluator)
    return {
        "statusCode": 200,
        "body": json.dumps({
            "status": decision.status.value,
            "rationale": decision.rationale,
            "decision_id": decision.decision_id,
        }),
    }

Package with: pip install aegis-governance[engine] -t ./package

ECS/Fargate (Metrics Server)

For long-running metrics collection:

{
  "family": "aegis-governance",
  "networkMode": "awsvpc",
  "requiresCompatibilities": ["FARGATE"],
  "cpu": "256",
  "memory": "512",
  "containerDefinitions": [
    {
      "name": "aegis",
      "image": "aegis-governance:latest",
      "command": ["python", "-c", "import signal; from telemetry.metrics_server import MetricsServer; s=MetricsServer(host='0.0.0.0',port=9090); s.start(); signal.sigwait({signal.SIGTERM,signal.SIGINT}); s.stop()"],
      "portMappings": [
        { "containerPort": 9090, "protocol": "tcp" }
      ],
      "environment": [
        { "name": "AEGIS_METRICS_PORT", "value": "9090" }
      ],
      "healthCheck": {
        "command": ["CMD-SHELL", "aegis version || exit 1"],
        "interval": 30,
        "timeout": 5,
        "retries": 3
      }
    }
  ]
}

DynamoDB (Audit Trail)

For serverless audit trail storage, implement a DynamoDB-backed repository following the WorkflowPersistence protocol in src/workflows/persistence/repository.py. The ORM models in src/workflows/persistence/models.py define the schema (WorkflowInstance, WorkflowTransition, WorkflowCheckpoint).


7. Observability Setup

Metrics Endpoint

Start the built-in metrics server:

# Programmatic (persistent server in background thread)
from telemetry.metrics_server import MetricsServer
server = MetricsServer(port=9090)
server.start()

To dump current metrics once (useful for debugging):

aegis metrics

Available Metrics

Metric Type Description
aegis_decisions_total Counter Decision outcomes by gate type
aegis_gates_evaluated_total Counter Gate evaluations performed
aegis_latency_seconds Histogram Operation latency
aegis_decision_latency_seconds Histogram End-to-end decision latency
aegis_proposals_total Counter Proposals by state
aegis_active_proposals Gauge Currently active proposals
aegis_kl_divergence Gauge Current KL divergence
aegis_override_requests_total Counter Override requests by outcome
aegis_errors_total Counter Errors by component

Prometheus Setup

  1. Copy recording rules: monitoring/prometheus/recording-rules.yaml
  2. Copy alerting rules: monitoring/prometheus/alerting-rules.yaml
  3. Configure scrape target: http://aegis:9090/metrics

Pre-computed recording rules: - aegis:gate_pass_rate_5m — Gate pass rate over 5 minutes - aegis:decision_rate_5m — Decision rate by status - aegis:p99_latency_5m — p99 latency by operation - aegis:override_rate_1h — Override request rate - aegis:error_rate_5m — Error rate by component

Alerting Rules

Alert Condition Severity
AegisHighGateFailRate Gate pass rate < 50% for 10m Warning
AegisHighLatency p99 latency > 1s for 5m Warning
AegisOverrideSpike Override rate > 0.1/min for 15m Critical
AegisDriftCritical KL divergence status = critical for 5m Critical
AegisOverrideStalePartial Partial override stuck > 2h Warning
AegisErrorRate Error rate > 5% for 5m Warning

Grafana Dashboards

Import from monitoring/grafana/: - overview-dashboard.json — AEGIS system overview (decisions, gates, latency) - risk-analysis-dashboard.json — Risk analytics (drift, Bayesian, overrides)

HTTP Telemetry Sink

Stream telemetry events to a remote collector via HTTP POST.

CLI:

echo '{"risk_proposed": 0.3, "profit_proposed": 0.1}' | \
  aegis evaluate --telemetry-url https://yd1xm4ahcg.execute-api.us-west-2.amazonaws.com/dev/v1/events

YAML Configuration (aegis.yaml):

telemetry_url: "https://yd1xm4ahcg.execute-api.us-west-2.amazonaws.com/dev/v1/events"

Programmatic (per-event):

from aegis_governance import HTTPEventSink, TelemetryEmitter, pcw_decide

emitter = TelemetryEmitter(source="my-service")
emitter.add_sink(HTTPEventSink(
    url="https://yd1xm4ahcg.execute-api.us-west-2.amazonaws.com/dev/v1/events",
    headers={"Authorization": "Bearer ${TOKEN}"},
    timeout=10,
))
result = pcw_decide(context, telemetry_emitter=emitter)

Programmatic (batched — recommended for production):

from aegis_governance import http_sink

sink = http_sink(
    "https://yd1xm4ahcg.execute-api.us-west-2.amazonaws.com/dev/v1/events",
    batch_size=100,
    flush_interval_seconds=60,
)
sink.start()  # Background flush thread
emitter.add_sink(sink)
# ... on shutdown:
sink.stop()   # Drains remaining events

Production notes: - Use HTTPS with authentication headers (Authorization: Bearer ... or X-API-Key: ...) - BatchHTTPSink buffers events and flushes as JSON arrays (bounded: maxlen=batch_size*10) - Failures are logged but never propagated — telemetry must not crash the producer - Background flush thread is a daemon and sleeps in 1-second increments for responsive shutdown

TLS Requirement (CoSAI MCP-T7)

HTTPEventSink and BatchHTTPSink require https:// URLs by default. Governance telemetry contains decision rationale, gate results, and drift metrics — transmitting this over plaintext risks interception and tampering.

Component Enforcement Escape Hatch
HTTPEventSink Rejects http:// URLs allow_insecure=True
BatchHTTPSink Rejects http:// URLs allow_insecure=True
http_sink() factory Rejects http:// URLs allow_insecure=True
MCP telemetry_url param Rejects http:// scheme None (network-facing)
CLI --telemetry-url Warns and disables on http:// None

Local development (when you need http://):

from telemetry.emitter import HTTPEventSink

# Explicitly opt into insecure transport for local development
sink = HTTPEventSink(
    url="http://localhost:9090/events",
    allow_insecure=True,  # Logs a WARNING for audit trail
)

MCP server: The MCP server (aegis-mcp-server) does not accept http:// telemetry URLs regardless of binding. This is defense-in-depth — a network-accessible endpoint should never relay telemetry over plaintext.

Reference: CoSAI MCP Security Taxonomy, MCP-T7 (Session and Transport Security Failures) recommends TLS 1.2+ for all MCP transport channels.


8. HSM Integration

Status: Placeholder — requires crypto provider extension (future work)

Architecture

AEGIS two-key override mechanism requires Ed25519 + ML-DSA-44 hybrid signatures. For production environments, private keys should be stored in a Hardware Security Module (HSM).

Supported HSMs (Planned)

HSM Interface Notes
AWS CloudHSM PKCS#11 Production recommended
YubiHSM 2 PKCS#11 On-premises
SoftHSM2 PKCS#11 Development/testing only

Key Custody

  • Ed25519 signing keys: Generated and stored IN the HSM
  • ML-DSA-44 keys: Generated and stored IN the HSM (when PQ-hardened)
  • Key Encryption Keys (KEK): See src/crypto/kek_provider.py and scripts/generate_master_kek.py
  • Two-key requirement: Both override signers must have independent HSM access

Thread Safety

HSM sessions should use a connection pool pattern (one session per thread). The AEGIS crypto providers are already thread-safe via threading.Lock.


9. Multi-Region DR

See Disaster Recovery Assessment for full details.

Strategy

Active-passive with audit trail replication:

Component Primary DR
AEGIS Process Active Standby
PostgreSQL Primary Streaming replica
Audit Trail Write-ahead Replicated

RPO / RTO Targets

Scenario RPO RTO
Auto-checkpoint on transition ~0 seconds < 60 seconds
Manual checkpoint <= 5 minutes < 60 seconds
No persistence configured N/A (in-memory) Restart only

Recovery Procedure

# 1. Verify system health
aegis health

# 2. Resume pending workflows (programmatic)
from workflows.persistence.durable import DurableWorkflowEngine
engine = DurableWorkflowEngine(session_factory)
await engine.resume_all_pending()

Persistence Layer

  • Production: PostgreSQL with streaming replication
  • Development: SQLite (in-memory or file-based)
  • Schema: src/workflows/persistence/models.py (WorkflowInstance, WorkflowTransition, WorkflowCheckpoint)
  • Hash chain: SHA-256 chained audit trail for tamper detection

10. Production Checklist

  • [ ] Configuration validated: aegis validate-config <config.yaml>
  • [ ] Metrics endpoint accessible: curl http://localhost:9090/metrics
  • [ ] Prometheus scraping confirmed: Check Prometheus targets page
  • [ ] Alert rules loaded: Verify in Prometheus UI (Status > Rules)
  • [ ] Grafana dashboards imported: Overview + Risk Analysis
  • [ ] RBAC roles configured: schema/rbac-definitions.yaml reviewed
  • [ ] Audit trail persistence configured: DATABASE_URL set (if using persistence)
  • [ ] Health check passing: aegis health or aegis version
  • [ ] HSM keys provisioned: (if PQ-hardened profile — see section 8)
  • [ ] DR failover tested: (if multi-region — see section 9)
  • [ ] Quality gates green: ruff check src/ && black --check src/ && mypy src/ && bandit -c pyproject.toml -r src/ && pytest tests/ -v
  • [ ] Telemetry URLs use HTTPS: All telemetry_url values use https:// (enforced by default)
  • [ ] No secrets in deployment config: Verify all credentials use environment variables or secret stores

References