AEGIS Production Deployment Guide¶

Version: 1.0.0 | Updated: 2026-02-09 | Status: Active

This guide covers deploying AEGIS in production environments including Docker, Kubernetes, and AWS.

1. Prerequisites¶

Python: 3.9+ (3.11 recommended for production)
pip: 21.0+
OS: Linux (Ubuntu 22.04+, Amazon Linux 2023) or macOS
Optional: Docker 24+, Kubernetes 1.28+, PostgreSQL 15+

Deployment Profiles¶

Profile	Install Command	Use Case
Minimal	`pip install aegis-governance`	Evaluation only, zero dependencies
Standard	`pip install aegis-governance[engine,telemetry]`	Production with metrics + scipy z-scores
Full	`pip install aegis-governance[all]`	All features including crypto + persistence
PQ-Hardened	`pip install aegis-governance[crypto,pqc,persistence]`	Post-quantum signatures + durable state

2. Installation Profiles¶

Minimal (Zero Dependencies)¶

pip install aegis-governance

Provides: pcw_decide(), CLI (aegis evaluate), gate evaluation, Bayesian posterior. No scipy (z-scores unavailable), no Prometheus metrics, no YAML config loading.

Standard (Recommended for Production)¶

pip install aegis-governance[engine,telemetry,config]

Adds: - engine: scipy for utility z-score computation - telemetry: prometheus_client for Prometheus metrics exporter - config: pyyaml for YAML configuration loading

Full¶

pip install aegis-governance[all]

Adds all optional groups: engine, telemetry, config, mcp, crypto, pqc, persistence.

Optional Dependency Groups¶

Group	Package(s)	Purpose
`engine`	scipy	Utility z-score computation
`telemetry`	prometheus_client	Prometheus metrics exporter
`config`	pyyaml	YAML configuration loading
`mcp`	pyyaml	MCP server for AI agents
`crypto`	btclib, coincurve	BIP-340 Schnorr signatures
`pqc`	liboqs-python	ML-DSA-44, ML-KEM-768 (requires native liboqs)
`persistence`	sqlalchemy, asyncpg, aiosqlite	Durable workflow state

3. Configuration¶

Default Configuration¶

from aegis_governance import AegisConfig

# Uses frozen defaults matching schema/interface-contract.yaml
config = AegisConfig.default()

YAML Configuration¶

# Requires: pip install aegis-governance[config]
config = AegisConfig.from_yaml("config.yaml")

Example config.yaml:

parameters:
  epsilon_R: 0.01
  epsilon_P: 0.01
  risk_trigger_factor: 2.0
  profit_trigger_factor: 2.0
  trigger_confidence_prob: 0.95
  novelty_gate:
    N0: 0.7
    k: 10.0
    output_threshold: 0.8
  complexity_floor: 0.5
  quality_min_score: 0.7

Dict Configuration¶

config = AegisConfig.from_dict({"epsilon_R": 0.02, "quality_min_score": 0.8})

Environment Variables¶

Variable	Purpose	Default
`AEGIS_CONFIG_PATH`	Path to YAML config file	None (uses defaults)
`AEGIS_METRICS_PORT`	Metrics server port	`9090`
`AEGIS_LOG_LEVEL`	Logging level	`INFO`
`DATABASE_URL`	PostgreSQL connection string	None (in-memory)

Frozen Parameter Policy¶

schema/interface-contract.yaml is the authoritative source for parameter values. AegisConfig defaults match this file exactly. Runtime mutation is impossible (frozen dataclass). Parameter changes require formal recalibration approval.

4. Docker Deployment¶

Dockerfile¶

The repository includes a multi-stage Dockerfile:

docker build -t aegis-governance .

Key features: - Multi-stage build (smaller image) - Non-root user (aegis, UID 1000) - Health check via aegis version - Exposes port 9090 for metrics

Docker Compose¶

Start AEGIS with Prometheus and Grafana:

docker compose up -d

Services:

Service	Port	Purpose
`aegis`	9090	AEGIS metrics server
`prometheus`	9091	Prometheus monitoring
`grafana`	3000	Grafana dashboards

Environment Variables (Docker)¶

Variable	Default	Purpose
`AEGIS_CONFIG_PATH`	`/app/schema/interface-contract.yaml`	Config file path
`AEGIS_METRICS_PORT`	`9090`	Metrics endpoint port
`DATABASE_URL`	None	PostgreSQL connection string

Volume Mounts¶

Container Path	Purpose
`/app/schema/`	Configuration schemas (read-only)
`/app/config/`	Custom configuration (optional)

5. Kubernetes Deployment¶

Deployment¶

apiVersion: apps/v1
kind: Deployment
metadata:
  name: aegis-governance
  labels:
    app: aegis-governance
spec:
  replicas: 2
  selector:
    matchLabels:
      app: aegis-governance
  template:
    metadata:
      labels:
        app: aegis-governance
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "9090"
        prometheus.io/path: "/metrics"
    spec:
      securityContext:
        runAsNonRoot: true
        runAsUser: 1000
      containers:
        - name: aegis
          image: aegis-governance:latest
          command: ["python", "-c", "import signal; from telemetry.metrics_server import MetricsServer; s=MetricsServer(host='0.0.0.0',port=9090); s.start(); signal.sigwait({signal.SIGTERM,signal.SIGINT}); s.stop()"]
          ports:
            - containerPort: 9090
              name: metrics
          resources:
            requests:
              cpu: "100m"
              memory: "128Mi"
            limits:
              cpu: "500m"
              memory: "256Mi"
          livenessProbe:
            exec:
              command: ["aegis", "version"]
            initialDelaySeconds: 5
            periodSeconds: 30
          readinessProbe:
            httpGet:
              path: /metrics
              port: 9090
            initialDelaySeconds: 5
            periodSeconds: 10
          env:
            - name: AEGIS_CONFIG_PATH
              value: "/app/config/config.yaml"
            - name: DATABASE_URL
              valueFrom:
                secretKeyRef:
                  name: aegis-secrets
                  key: database-url
                  optional: true
          volumeMounts:
            - name: config
              mountPath: /app/config
              readOnly: true
      volumes:
        - name: config
          configMap:
            name: aegis-config

ConfigMap¶

apiVersion: v1
kind: ConfigMap
metadata:
  name: aegis-config
data:
  config.yaml: |
    parameters:
      epsilon_R: 0.01
      epsilon_P: 0.01
      risk_trigger_factor: 2.0
      profit_trigger_factor: 2.0
      trigger_confidence_prob: 0.95

Secret¶

apiVersion: v1
kind: Secret
metadata:
  name: aegis-secrets
type: Opaque
stringData:
  database-url: "postgresql+asyncpg://{USER}:{PASSWORD}@{HOST}:5432/aegis"
  # HSM credentials (if PQ-hardened profile)
  hsm-pin: "{HSM_PIN}"

Service + ServiceMonitor¶

apiVersion: v1
kind: Service
metadata:
  name: aegis-governance
  labels:
    app: aegis-governance
spec:
  selector:
    app: aegis-governance
  ports:
    - name: metrics
      port: 9090
      targetPort: 9090
---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: aegis-governance
  labels:
    release: prometheus
spec:
  selector:
    matchLabels:
      app: aegis-governance
  endpoints:
    - port: metrics
      interval: 30s
      path: /metrics

6. AWS Deployment¶

Lambda (Single Evaluation)¶

Use Lambda for on-demand proposal evaluation:

import json
from aegis_governance import AegisConfig, PCWContext, PCWPhase, pcw_decide

config = AegisConfig.default()
evaluator = config.create_gate_evaluator()

def handler(event, context):
    ctx = PCWContext(
        agent_id=event.get("agent_id", "lambda"),
        session_id=context.aws_request_id,
        phase=PCWPhase.PLAN,
        proposal_summary=event["proposal_summary"],
        estimated_impact=event.get("estimated_impact", "medium"),
        risk_proposed=event.get("risk_score", 0.1),
        complexity_score=event.get("complexity_score", 0.5),
        quality_score=event.get("quality_score", 0.8),
    )
    decision = pcw_decide(ctx, gate_evaluator=evaluator)
    return {
        "statusCode": 200,
        "body": json.dumps({
            "status": decision.status.value,
            "rationale": decision.rationale,
            "decision_id": decision.decision_id,
        }),
    }

Package with: pip install aegis-governance[engine] -t ./package

ECS/Fargate (Metrics Server)¶

For long-running metrics collection:

{
  "family": "aegis-governance",
  "networkMode": "awsvpc",
  "requiresCompatibilities": ["FARGATE"],
  "cpu": "256",
  "memory": "512",
  "containerDefinitions": [
    {
      "name": "aegis",
      "image": "aegis-governance:latest",
      "command": ["python", "-c", "import signal; from telemetry.metrics_server import MetricsServer; s=MetricsServer(host='0.0.0.0',port=9090); s.start(); signal.sigwait({signal.SIGTERM,signal.SIGINT}); s.stop()"],
      "portMappings": [
        { "containerPort": 9090, "protocol": "tcp" }
      ],
      "environment": [
        { "name": "AEGIS_METRICS_PORT", "value": "9090" }
      ],
      "healthCheck": {
        "command": ["CMD-SHELL", "aegis version || exit 1"],
        "interval": 30,
        "timeout": 5,
        "retries": 3
      }
    }
  ]
}

DynamoDB (Audit Trail)¶

For serverless audit trail storage, implement a DynamoDB-backed repository following the WorkflowPersistence protocol in src/workflows/persistence/repository.py. The ORM models in src/workflows/persistence/models.py define the schema (WorkflowInstance, WorkflowTransition, WorkflowCheckpoint).

7. Observability Setup¶

Metrics Endpoint¶

Start the built-in metrics server:

# Programmatic (persistent server in background thread)
from telemetry.metrics_server import MetricsServer
server = MetricsServer(port=9090)
server.start()

To dump current metrics once (useful for debugging):

aegis metrics

Available Metrics¶

Metric	Type	Description
`aegis_decisions_total`	Counter	Decision outcomes by gate type
`aegis_gates_evaluated_total`	Counter	Gate evaluations performed
`aegis_latency_seconds`	Histogram	Operation latency
`aegis_decision_latency_seconds`	Histogram	End-to-end decision latency
`aegis_proposals_total`	Counter	Proposals by state
`aegis_active_proposals`	Gauge	Currently active proposals
`aegis_kl_divergence`	Gauge	Current KL divergence
`aegis_override_requests_total`	Counter	Override requests by outcome
`aegis_errors_total`	Counter	Errors by component

Prometheus Setup¶

Copy recording rules: monitoring/prometheus/recording-rules.yaml
Copy alerting rules: monitoring/prometheus/alerting-rules.yaml
Configure scrape target: http://aegis:9090/metrics

Pre-computed recording rules: - aegis:gate_pass_rate_5m — Gate pass rate over 5 minutes - aegis:decision_rate_5m — Decision rate by status - aegis:p99_latency_5m — p99 latency by operation - aegis:override_rate_1h — Override request rate - aegis:error_rate_5m — Error rate by component

Alerting Rules¶

Alert	Condition	Severity
`AegisHighGateFailRate`	Gate pass rate < 50% for 10m	Warning
`AegisHighLatency`	p99 latency > 1s for 5m	Warning
`AegisOverrideSpike`	Override rate > 0.1/min for 15m	Critical
`AegisDriftCritical`	KL divergence status = critical for 5m	Critical
`AegisOverrideStalePartial`	Partial override stuck > 2h	Warning
`AegisErrorRate`	Error rate > 5% for 5m	Warning

Grafana Dashboards¶

Import from monitoring/grafana/: - overview-dashboard.json — AEGIS system overview (decisions, gates, latency) - risk-analysis-dashboard.json — Risk analytics (drift, Bayesian, overrides)

HTTP Telemetry Sink¶

Stream telemetry events to a remote collector via HTTP POST.

CLI:

echo '{"risk_proposed": 0.3, "profit_proposed": 0.1}' | \
  aegis evaluate --telemetry-url https://yd1xm4ahcg.execute-api.us-west-2.amazonaws.com/dev/v1/events

YAML Configuration (aegis.yaml):

telemetry_url: "https://yd1xm4ahcg.execute-api.us-west-2.amazonaws.com/dev/v1/events"

Programmatic (per-event):

from aegis_governance import HTTPEventSink, TelemetryEmitter, pcw_decide

emitter = TelemetryEmitter(source="my-service")
emitter.add_sink(HTTPEventSink(
    url="https://yd1xm4ahcg.execute-api.us-west-2.amazonaws.com/dev/v1/events",
    headers={"Authorization": "Bearer ${TOKEN}"},
    timeout=10,
))
result = pcw_decide(context, telemetry_emitter=emitter)

Programmatic (batched — recommended for production):

from aegis_governance import http_sink

sink = http_sink(
    "https://yd1xm4ahcg.execute-api.us-west-2.amazonaws.com/dev/v1/events",
    batch_size=100,
    flush_interval_seconds=60,
)
sink.start()  # Background flush thread
emitter.add_sink(sink)
# ... on shutdown:
sink.stop()   # Drains remaining events

Production notes: - Use HTTPS with authentication headers (Authorization: Bearer ... or X-API-Key: ...) - BatchHTTPSink buffers events and flushes as JSON arrays (bounded: maxlen=batch_size*10) - Failures are logged but never propagated — telemetry must not crash the producer - Background flush thread is a daemon and sleeps in 1-second increments for responsive shutdown

TLS Requirement (CoSAI MCP-T7)¶

HTTPEventSink and BatchHTTPSink require https:// URLs by default. Governance telemetry contains decision rationale, gate results, and drift metrics — transmitting this over plaintext risks interception and tampering.

Component	Enforcement	Escape Hatch
`HTTPEventSink`	Rejects `http://` URLs	`allow_insecure=True`
`BatchHTTPSink`	Rejects `http://` URLs	`allow_insecure=True`
`http_sink()` factory	Rejects `http://` URLs	`allow_insecure=True`
MCP `telemetry_url` param	Rejects `http://` scheme	None (network-facing)
CLI `--telemetry-url`	Warns and disables on `http://`	None

Local development (when you need http://):

from telemetry.emitter import HTTPEventSink

# Explicitly opt into insecure transport for local development
sink = HTTPEventSink(
    url="http://localhost:9090/events",
    allow_insecure=True,  # Logs a WARNING for audit trail
)

MCP server: The MCP server (aegis-mcp-server) does not accept http:// telemetry URLs regardless of binding. This is defense-in-depth — a network-accessible endpoint should never relay telemetry over plaintext.

Reference: CoSAI MCP Security Taxonomy, MCP-T7 (Session and Transport Security Failures) recommends TLS 1.2+ for all MCP transport channels.

8. HSM Integration¶

Status: Placeholder — requires crypto provider extension (future work)

Architecture¶

AEGIS two-key override mechanism requires Ed25519 + ML-DSA-44 hybrid signatures. For production environments, private keys should be stored in a Hardware Security Module (HSM).

Supported HSMs (Planned)¶

HSM	Interface	Notes
AWS CloudHSM	PKCS#11	Production recommended
YubiHSM 2	PKCS#11	On-premises
SoftHSM2	PKCS#11	Development/testing only

Key Custody¶

Ed25519 signing keys: Generated and stored IN the HSM
ML-DSA-44 keys: Generated and stored IN the HSM (when PQ-hardened)
Key Encryption Keys (KEK): See src/crypto/kek_provider.py and scripts/generate_master_kek.py
Two-key requirement: Both override signers must have independent HSM access

Thread Safety¶

HSM sessions should use a connection pool pattern (one session per thread). The AEGIS crypto providers are already thread-safe via threading.Lock.

9. Multi-Region DR¶

See Disaster Recovery Assessment for full details.

Strategy¶

Active-passive with audit trail replication:

Component	Primary	DR
AEGIS Process	Active	Standby
PostgreSQL	Primary	Streaming replica
Audit Trail	Write-ahead	Replicated

RPO / RTO Targets¶

Scenario	RPO	RTO
Auto-checkpoint on transition	~0 seconds	< 60 seconds
Manual checkpoint	<= 5 minutes	< 60 seconds
No persistence configured	N/A (in-memory)	Restart only

Recovery Procedure¶

# 1. Verify system health
aegis health

# 2. Resume pending workflows (programmatic)
from workflows.persistence.durable import DurableWorkflowEngine
engine = DurableWorkflowEngine(session_factory)
await engine.resume_all_pending()

Persistence Layer¶

Production: PostgreSQL with streaming replication
Development: SQLite (in-memory or file-based)
Schema: src/workflows/persistence/models.py (WorkflowInstance, WorkflowTransition, WorkflowCheckpoint)
Hash chain: SHA-256 chained audit trail for tamper detection

10. Production Checklist¶

[ ] Configuration validated: aegis validate-config <config.yaml>
[ ] Metrics endpoint accessible: curl http://localhost:9090/metrics
[ ] Prometheus scraping confirmed: Check Prometheus targets page
[ ] Alert rules loaded: Verify in Prometheus UI (Status > Rules)
[ ] Grafana dashboards imported: Overview + Risk Analysis
[ ] RBAC roles configured: schema/rbac-definitions.yaml reviewed
[ ] Audit trail persistence configured: DATABASE_URL set (if using persistence)
[ ] Health check passing: aegis health or aegis version
[ ] HSM keys provisioned: (if PQ-hardened profile — see section 8)
[ ] DR failover tested: (if multi-region — see section 9)
[ ] Quality gates green: ruff check src/ && black --check src/ && mypy src/ && bandit -c pyproject.toml -r src/ && pytest tests/ -v
[ ] Telemetry URLs use HTTPS: All telemetry_url values use https:// (enforced by default)
[ ] No secrets in deployment config: Verify all credentials use environment variables or secret stores

References¶

README.md — Quick start and installation
monitoring/README.md — Metrics endpoint setup
DR Assessment — Disaster recovery details
Interface Contract — Frozen parameters
Performance SLAs — Latency and throughput targets
Migration Guide — Upgrade procedures