AEGIS Production Deployment Guide¶
Version: 1.0.0 | Updated: 2026-02-09 | Status: Active
This guide covers deploying AEGIS in production environments including Docker, Kubernetes, and AWS.
1. Prerequisites¶
- Python: 3.9+ (3.11 recommended for production)
- pip: 21.0+
- OS: Linux (Ubuntu 22.04+, Amazon Linux 2023) or macOS
- Optional: Docker 24+, Kubernetes 1.28+, PostgreSQL 15+
Deployment Profiles¶
| Profile | Install Command | Use Case |
|---|---|---|
| Minimal | pip install aegis-governance | Evaluation only, zero dependencies |
| Standard | pip install aegis-governance[engine,telemetry] | Production with metrics + scipy z-scores |
| Full | pip install aegis-governance[all] | All features including crypto + persistence |
| PQ-Hardened | pip install aegis-governance[crypto,pqc,persistence] | Post-quantum signatures + durable state |
2. Installation Profiles¶
Minimal (Zero Dependencies)¶
Provides: pcw_decide(), CLI (aegis evaluate), gate evaluation, Bayesian posterior. No scipy (z-scores unavailable), no Prometheus metrics, no YAML config loading.
Standard (Recommended for Production)¶
Adds: - engine: scipy for utility z-score computation - telemetry: prometheus_client for Prometheus metrics exporter - config: pyyaml for YAML configuration loading
Full¶
Adds all optional groups: engine, telemetry, config, mcp, crypto, pqc, persistence.
Optional Dependency Groups¶
| Group | Package(s) | Purpose |
|---|---|---|
engine | scipy | Utility z-score computation |
telemetry | prometheus_client | Prometheus metrics exporter |
config | pyyaml | YAML configuration loading |
mcp | pyyaml | MCP server for AI agents |
crypto | btclib, coincurve | BIP-340 Schnorr signatures |
pqc | liboqs-python | ML-DSA-44, ML-KEM-768 (requires native liboqs) |
persistence | sqlalchemy, asyncpg, aiosqlite | Durable workflow state |
3. Configuration¶
Default Configuration¶
from aegis_governance import AegisConfig
# Uses frozen defaults matching schema/interface-contract.yaml
config = AegisConfig.default()
YAML Configuration¶
Example config.yaml:
parameters:
epsilon_R: 0.01
epsilon_P: 0.01
risk_trigger_factor: 2.0
profit_trigger_factor: 2.0
trigger_confidence_prob: 0.95
novelty_gate:
N0: 0.7
k: 10.0
output_threshold: 0.8
complexity_floor: 0.5
quality_min_score: 0.7
Dict Configuration¶
Environment Variables¶
| Variable | Purpose | Default |
|---|---|---|
AEGIS_CONFIG_PATH | Path to YAML config file | None (uses defaults) |
AEGIS_METRICS_PORT | Metrics server port | 9090 |
AEGIS_LOG_LEVEL | Logging level | INFO |
DATABASE_URL | PostgreSQL connection string | None (in-memory) |
Frozen Parameter Policy¶
schema/interface-contract.yaml is the authoritative source for parameter values. AegisConfig defaults match this file exactly. Runtime mutation is impossible (frozen dataclass). Parameter changes require formal recalibration approval.
4. Docker Deployment¶
Dockerfile¶
The repository includes a multi-stage Dockerfile:
Key features: - Multi-stage build (smaller image) - Non-root user (aegis, UID 1000) - Health check via aegis version - Exposes port 9090 for metrics
Docker Compose¶
Start AEGIS with Prometheus and Grafana:
Services:
| Service | Port | Purpose |
|---|---|---|
aegis | 9090 | AEGIS metrics server |
prometheus | 9091 | Prometheus monitoring |
grafana | 3000 | Grafana dashboards |
Environment Variables (Docker)¶
| Variable | Default | Purpose |
|---|---|---|
AEGIS_CONFIG_PATH | /app/schema/interface-contract.yaml | Config file path |
AEGIS_METRICS_PORT | 9090 | Metrics endpoint port |
DATABASE_URL | None | PostgreSQL connection string |
Volume Mounts¶
| Container Path | Purpose |
|---|---|
/app/schema/ | Configuration schemas (read-only) |
/app/config/ | Custom configuration (optional) |
5. Kubernetes Deployment¶
Deployment¶
apiVersion: apps/v1
kind: Deployment
metadata:
name: aegis-governance
labels:
app: aegis-governance
spec:
replicas: 2
selector:
matchLabels:
app: aegis-governance
template:
metadata:
labels:
app: aegis-governance
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "9090"
prometheus.io/path: "/metrics"
spec:
securityContext:
runAsNonRoot: true
runAsUser: 1000
containers:
- name: aegis
image: aegis-governance:latest
command: ["python", "-c", "import signal; from telemetry.metrics_server import MetricsServer; s=MetricsServer(host='0.0.0.0',port=9090); s.start(); signal.sigwait({signal.SIGTERM,signal.SIGINT}); s.stop()"]
ports:
- containerPort: 9090
name: metrics
resources:
requests:
cpu: "100m"
memory: "128Mi"
limits:
cpu: "500m"
memory: "256Mi"
livenessProbe:
exec:
command: ["aegis", "version"]
initialDelaySeconds: 5
periodSeconds: 30
readinessProbe:
httpGet:
path: /metrics
port: 9090
initialDelaySeconds: 5
periodSeconds: 10
env:
- name: AEGIS_CONFIG_PATH
value: "/app/config/config.yaml"
- name: DATABASE_URL
valueFrom:
secretKeyRef:
name: aegis-secrets
key: database-url
optional: true
volumeMounts:
- name: config
mountPath: /app/config
readOnly: true
volumes:
- name: config
configMap:
name: aegis-config
ConfigMap¶
apiVersion: v1
kind: ConfigMap
metadata:
name: aegis-config
data:
config.yaml: |
parameters:
epsilon_R: 0.01
epsilon_P: 0.01
risk_trigger_factor: 2.0
profit_trigger_factor: 2.0
trigger_confidence_prob: 0.95
Secret¶
apiVersion: v1
kind: Secret
metadata:
name: aegis-secrets
type: Opaque
stringData:
database-url: "postgresql+asyncpg://{USER}:{PASSWORD}@{HOST}:5432/aegis"
# HSM credentials (if PQ-hardened profile)
hsm-pin: "{HSM_PIN}"
Service + ServiceMonitor¶
apiVersion: v1
kind: Service
metadata:
name: aegis-governance
labels:
app: aegis-governance
spec:
selector:
app: aegis-governance
ports:
- name: metrics
port: 9090
targetPort: 9090
---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: aegis-governance
labels:
release: prometheus
spec:
selector:
matchLabels:
app: aegis-governance
endpoints:
- port: metrics
interval: 30s
path: /metrics
6. AWS Deployment¶
Lambda (Single Evaluation)¶
Use Lambda for on-demand proposal evaluation:
import json
from aegis_governance import AegisConfig, PCWContext, PCWPhase, pcw_decide
config = AegisConfig.default()
evaluator = config.create_gate_evaluator()
def handler(event, context):
ctx = PCWContext(
agent_id=event.get("agent_id", "lambda"),
session_id=context.aws_request_id,
phase=PCWPhase.PLAN,
proposal_summary=event["proposal_summary"],
estimated_impact=event.get("estimated_impact", "medium"),
risk_proposed=event.get("risk_score", 0.1),
complexity_score=event.get("complexity_score", 0.5),
quality_score=event.get("quality_score", 0.8),
)
decision = pcw_decide(ctx, gate_evaluator=evaluator)
return {
"statusCode": 200,
"body": json.dumps({
"status": decision.status.value,
"rationale": decision.rationale,
"decision_id": decision.decision_id,
}),
}
Package with: pip install aegis-governance[engine] -t ./package
ECS/Fargate (Metrics Server)¶
For long-running metrics collection:
{
"family": "aegis-governance",
"networkMode": "awsvpc",
"requiresCompatibilities": ["FARGATE"],
"cpu": "256",
"memory": "512",
"containerDefinitions": [
{
"name": "aegis",
"image": "aegis-governance:latest",
"command": ["python", "-c", "import signal; from telemetry.metrics_server import MetricsServer; s=MetricsServer(host='0.0.0.0',port=9090); s.start(); signal.sigwait({signal.SIGTERM,signal.SIGINT}); s.stop()"],
"portMappings": [
{ "containerPort": 9090, "protocol": "tcp" }
],
"environment": [
{ "name": "AEGIS_METRICS_PORT", "value": "9090" }
],
"healthCheck": {
"command": ["CMD-SHELL", "aegis version || exit 1"],
"interval": 30,
"timeout": 5,
"retries": 3
}
}
]
}
DynamoDB (Audit Trail)¶
For serverless audit trail storage, implement a DynamoDB-backed repository following the WorkflowPersistence protocol in src/workflows/persistence/repository.py. The ORM models in src/workflows/persistence/models.py define the schema (WorkflowInstance, WorkflowTransition, WorkflowCheckpoint).
7. Observability Setup¶
Metrics Endpoint¶
Start the built-in metrics server:
# Programmatic (persistent server in background thread)
from telemetry.metrics_server import MetricsServer
server = MetricsServer(port=9090)
server.start()
To dump current metrics once (useful for debugging):
Available Metrics¶
| Metric | Type | Description |
|---|---|---|
aegis_decisions_total | Counter | Decision outcomes by gate type |
aegis_gates_evaluated_total | Counter | Gate evaluations performed |
aegis_latency_seconds | Histogram | Operation latency |
aegis_decision_latency_seconds | Histogram | End-to-end decision latency |
aegis_proposals_total | Counter | Proposals by state |
aegis_active_proposals | Gauge | Currently active proposals |
aegis_kl_divergence | Gauge | Current KL divergence |
aegis_override_requests_total | Counter | Override requests by outcome |
aegis_errors_total | Counter | Errors by component |
Prometheus Setup¶
- Copy recording rules:
monitoring/prometheus/recording-rules.yaml - Copy alerting rules:
monitoring/prometheus/alerting-rules.yaml - Configure scrape target:
http://aegis:9090/metrics
Pre-computed recording rules: - aegis:gate_pass_rate_5m — Gate pass rate over 5 minutes - aegis:decision_rate_5m — Decision rate by status - aegis:p99_latency_5m — p99 latency by operation - aegis:override_rate_1h — Override request rate - aegis:error_rate_5m — Error rate by component
Alerting Rules¶
| Alert | Condition | Severity |
|---|---|---|
AegisHighGateFailRate | Gate pass rate < 50% for 10m | Warning |
AegisHighLatency | p99 latency > 1s for 5m | Warning |
AegisOverrideSpike | Override rate > 0.1/min for 15m | Critical |
AegisDriftCritical | KL divergence status = critical for 5m | Critical |
AegisOverrideStalePartial | Partial override stuck > 2h | Warning |
AegisErrorRate | Error rate > 5% for 5m | Warning |
Grafana Dashboards¶
Import from monitoring/grafana/: - overview-dashboard.json — AEGIS system overview (decisions, gates, latency) - risk-analysis-dashboard.json — Risk analytics (drift, Bayesian, overrides)
HTTP Telemetry Sink¶
Stream telemetry events to a remote collector via HTTP POST.
CLI:
echo '{"risk_proposed": 0.3, "profit_proposed": 0.1}' | \
aegis evaluate --telemetry-url https://yd1xm4ahcg.execute-api.us-west-2.amazonaws.com/dev/v1/events
YAML Configuration (aegis.yaml):
Programmatic (per-event):
from aegis_governance import HTTPEventSink, TelemetryEmitter, pcw_decide
emitter = TelemetryEmitter(source="my-service")
emitter.add_sink(HTTPEventSink(
url="https://yd1xm4ahcg.execute-api.us-west-2.amazonaws.com/dev/v1/events",
headers={"Authorization": "Bearer ${TOKEN}"},
timeout=10,
))
result = pcw_decide(context, telemetry_emitter=emitter)
Programmatic (batched — recommended for production):
from aegis_governance import http_sink
sink = http_sink(
"https://yd1xm4ahcg.execute-api.us-west-2.amazonaws.com/dev/v1/events",
batch_size=100,
flush_interval_seconds=60,
)
sink.start() # Background flush thread
emitter.add_sink(sink)
# ... on shutdown:
sink.stop() # Drains remaining events
Production notes: - Use HTTPS with authentication headers (Authorization: Bearer ... or X-API-Key: ...) - BatchHTTPSink buffers events and flushes as JSON arrays (bounded: maxlen=batch_size*10) - Failures are logged but never propagated — telemetry must not crash the producer - Background flush thread is a daemon and sleeps in 1-second increments for responsive shutdown
TLS Requirement (CoSAI MCP-T7)¶
HTTPEventSink and BatchHTTPSink require https:// URLs by default. Governance telemetry contains decision rationale, gate results, and drift metrics — transmitting this over plaintext risks interception and tampering.
| Component | Enforcement | Escape Hatch |
|---|---|---|
HTTPEventSink | Rejects http:// URLs | allow_insecure=True |
BatchHTTPSink | Rejects http:// URLs | allow_insecure=True |
http_sink() factory | Rejects http:// URLs | allow_insecure=True |
MCP telemetry_url param | Rejects http:// scheme | None (network-facing) |
CLI --telemetry-url | Warns and disables on http:// | None |
Local development (when you need http://):
from telemetry.emitter import HTTPEventSink
# Explicitly opt into insecure transport for local development
sink = HTTPEventSink(
url="http://localhost:9090/events",
allow_insecure=True, # Logs a WARNING for audit trail
)
MCP server: The MCP server (aegis-mcp-server) does not accept http:// telemetry URLs regardless of binding. This is defense-in-depth — a network-accessible endpoint should never relay telemetry over plaintext.
Reference: CoSAI MCP Security Taxonomy, MCP-T7 (Session and Transport Security Failures) recommends TLS 1.2+ for all MCP transport channels.
8. HSM Integration¶
Status: Placeholder — requires crypto provider extension (future work)
Architecture¶
AEGIS two-key override mechanism requires Ed25519 + ML-DSA-44 hybrid signatures. For production environments, private keys should be stored in a Hardware Security Module (HSM).
Supported HSMs (Planned)¶
| HSM | Interface | Notes |
|---|---|---|
| AWS CloudHSM | PKCS#11 | Production recommended |
| YubiHSM 2 | PKCS#11 | On-premises |
| SoftHSM2 | PKCS#11 | Development/testing only |
Key Custody¶
- Ed25519 signing keys: Generated and stored IN the HSM
- ML-DSA-44 keys: Generated and stored IN the HSM (when PQ-hardened)
- Key Encryption Keys (KEK): See
src/crypto/kek_provider.pyandscripts/generate_master_kek.py - Two-key requirement: Both override signers must have independent HSM access
Thread Safety¶
HSM sessions should use a connection pool pattern (one session per thread). The AEGIS crypto providers are already thread-safe via threading.Lock.
9. Multi-Region DR¶
See Disaster Recovery Assessment for full details.
Strategy¶
Active-passive with audit trail replication:
| Component | Primary | DR |
|---|---|---|
| AEGIS Process | Active | Standby |
| PostgreSQL | Primary | Streaming replica |
| Audit Trail | Write-ahead | Replicated |
RPO / RTO Targets¶
| Scenario | RPO | RTO |
|---|---|---|
| Auto-checkpoint on transition | ~0 seconds | < 60 seconds |
| Manual checkpoint | <= 5 minutes | < 60 seconds |
| No persistence configured | N/A (in-memory) | Restart only |
Recovery Procedure¶
# 1. Verify system health
aegis health
# 2. Resume pending workflows (programmatic)
from workflows.persistence.durable import DurableWorkflowEngine
engine = DurableWorkflowEngine(session_factory)
await engine.resume_all_pending()
Persistence Layer¶
- Production: PostgreSQL with streaming replication
- Development: SQLite (in-memory or file-based)
- Schema:
src/workflows/persistence/models.py(WorkflowInstance, WorkflowTransition, WorkflowCheckpoint) - Hash chain: SHA-256 chained audit trail for tamper detection
10. Production Checklist¶
- [ ] Configuration validated:
aegis validate-config <config.yaml> - [ ] Metrics endpoint accessible:
curl http://localhost:9090/metrics - [ ] Prometheus scraping confirmed: Check Prometheus targets page
- [ ] Alert rules loaded: Verify in Prometheus UI (Status > Rules)
- [ ] Grafana dashboards imported: Overview + Risk Analysis
- [ ] RBAC roles configured:
schema/rbac-definitions.yamlreviewed - [ ] Audit trail persistence configured:
DATABASE_URLset (if using persistence) - [ ] Health check passing:
aegis healthoraegis version - [ ] HSM keys provisioned: (if PQ-hardened profile — see section 8)
- [ ] DR failover tested: (if multi-region — see section 9)
- [ ] Quality gates green:
ruff check src/ && black --check src/ && mypy src/ && bandit -c pyproject.toml -r src/ && pytest tests/ -v - [ ] Telemetry URLs use HTTPS: All
telemetry_urlvalues usehttps://(enforced by default) - [ ] No secrets in deployment config: Verify all credentials use environment variables or secret stores
References¶
- README.md — Quick start and installation
- monitoring/README.md — Metrics endpoint setup
- DR Assessment — Disaster recovery details
- Interface Contract — Frozen parameters
- Performance SLAs — Latency and throughput targets
- Migration Guide — Upgrade procedures