Disaster Recovery Assessment¶

Version: 1.0.0 | Date: 2026-02-06 | Status: Phase 1 Complete

1. Current Architecture¶

AEGIS uses a checkpoint-based persistence model with the following components:

Component	Implementation	Location
State Storage	SQLAlchemy async ORM	`src/workflows/persistence/repository.py`
Checkpoint Model	SHA-256 hashed snapshots	`src/workflows/persistence/models.py`
Audit Trail	Hash-chained transitions	`WorkflowTransition.compute_hash()`
Recovery Engine	`DurableWorkflowEngine`	`src/workflows/persistence/durable.py`
Health Check	`aegis health` CLI	`src/cli.py`

Supported Databases¶

SQLite (development/testing): In-memory or file-based
PostgreSQL (production): Via asyncpg driver

2. Recovery Point Objective (RPO)¶

RPO = Time since last checkpoint

Scenario	RPO	Notes
Auto-checkpoint on transition	~0 seconds	State saved on every transition
Manual checkpoint	Configurable	Recommend <= 5 minutes
No persistence configured	N/A (in-memory only)	Data lost on process termination

Recommendation¶

Enable checkpoint_on_transition=True (default) for zero-RPO on state transitions. For long-running workflows between transitions, add periodic checkpoints at 5-minute intervals.

3. Recovery Time Objective (RTO)¶

RTO = Process restart + resume_all_pending() execution time

Component	Estimated Time	Notes
Process restart	< 5 seconds	Application startup
Database reconnection	< 2 seconds	Connection pool initialization
`resume_all_pending()`	< 30 seconds	For up to 100 workflows
Health verification	< 5 seconds	`aegis health` check
Total RTO	< 60 seconds	Single-instance recovery

Verification¶

# Verify system health after recovery
aegis health

# Resume pending workflows (programmatic)
engine = DurableWorkflowEngine(persistence)
workflows = await engine.resume_all_pending(ProposalWorkflow)

4. Integrity Guarantees¶

Hash Chain Verification¶

Every state transition is recorded with a SHA-256 hash chain:

Transition N: hash(workflow_id + from_state + to_state + actor + timestamp + previous_hash)

Tamper detection: If any transition record is modified, verify_audit_chain() will detect the broken chain and report the specific transition.

Checkpoint Integrity¶

Each checkpoint stores a SHA-256 hash of the serialized state snapshot. On restore, the hash can be recomputed and compared.

Verification Commands¶

# CLI health check (includes chain verification when persistence available)
aegis health

# Programmatic verification
valid, error = await engine.verify_integrity(workflow_id)
assert valid, f"Chain broken: {error}"

5. Tested Failure Scenarios¶

The following scenarios are covered by integration tests (tests/integration/test_dr_recovery.py):

Test	Scenario	Verified
Checkpoint-Resume	Create, checkpoint, discard, resume	State matches
Post-Transition Resume	Transition, checkpoint, resume	Correct state restored
Hash Chain Valid	Multiple transitions	Chain verification passes
Batch Recovery	5 workflows, resume all	All recovered
Completed Exclusion	Completed workflows	Not in pending list
Mixed States	Workflows at different stages	All resume correctly
Time-Travel	Load historical checkpoint	Correct historical state
Export/Import	`to_dict()` / `from_dict()` round-trip	State preserved
Audit Trail	Multiple transitions	All recorded with hashes

6. Current Limitations¶

Limitation	Impact	Mitigation Path
Single-instance only	No automatic failover	Add async replication (Phase 2)
No real-time replication	Potential data loss if DB fails	Use PostgreSQL with streaming replication
No multi-region	Single region dependency	Deploy read replicas in secondary region
No point-in-time recovery	Cannot restore to arbitrary timestamp	Use PostgreSQL PITR with WAL archiving
No automated backup	Manual backup required	Add scheduled pg_dump or WAL archiving

7. Phase 2 Upgrade Path¶

When deployment infrastructure is available:

PostgreSQL streaming replication for hot standby
WAL archiving for point-in-time recovery
Scheduled backups via pg_dump or cloud provider snapshots
Multi-region read replicas for geographic redundancy
Automated failover via Patroni or cloud provider HA
Live DR drill simulating region outage with documented runbook

Gap Analysis — Issue #8 tracking
Unified AEGIS Specification — Section 6 (Data Privacy & RBAC)
Interface Contract — RPO/RTO parameters