Disaster Recovery Assessment¶
Version: 1.0.0 | Date: 2026-02-06 | Status: Phase 1 Complete
1. Current Architecture¶
AEGIS uses a checkpoint-based persistence model with the following components:
| Component | Implementation | Location |
|---|---|---|
| State Storage | SQLAlchemy async ORM | src/workflows/persistence/repository.py |
| Checkpoint Model | SHA-256 hashed snapshots | src/workflows/persistence/models.py |
| Audit Trail | Hash-chained transitions | WorkflowTransition.compute_hash() |
| Recovery Engine | DurableWorkflowEngine | src/workflows/persistence/durable.py |
| Health Check | aegis health CLI | src/cli.py |
Supported Databases¶
- SQLite (development/testing): In-memory or file-based
- PostgreSQL (production): Via asyncpg driver
2. Recovery Point Objective (RPO)¶
RPO = Time since last checkpoint
| Scenario | RPO | Notes |
|---|---|---|
| Auto-checkpoint on transition | ~0 seconds | State saved on every transition |
| Manual checkpoint | Configurable | Recommend <= 5 minutes |
| No persistence configured | N/A (in-memory only) | Data lost on process termination |
Recommendation¶
Enable checkpoint_on_transition=True (default) for zero-RPO on state transitions. For long-running workflows between transitions, add periodic checkpoints at 5-minute intervals.
3. Recovery Time Objective (RTO)¶
RTO = Process restart + resume_all_pending() execution time
| Component | Estimated Time | Notes |
|---|---|---|
| Process restart | < 5 seconds | Application startup |
| Database reconnection | < 2 seconds | Connection pool initialization |
resume_all_pending() | < 30 seconds | For up to 100 workflows |
| Health verification | < 5 seconds | aegis health check |
| Total RTO | < 60 seconds | Single-instance recovery |
Verification¶
# Verify system health after recovery
aegis health
# Resume pending workflows (programmatic)
engine = DurableWorkflowEngine(persistence)
workflows = await engine.resume_all_pending(ProposalWorkflow)
4. Integrity Guarantees¶
Hash Chain Verification¶
Every state transition is recorded with a SHA-256 hash chain:
Tamper detection: If any transition record is modified, verify_audit_chain() will detect the broken chain and report the specific transition.
Checkpoint Integrity¶
Each checkpoint stores a SHA-256 hash of the serialized state snapshot. On restore, the hash can be recomputed and compared.
Verification Commands¶
# Programmatic verification
valid, error = await engine.verify_integrity(workflow_id)
assert valid, f"Chain broken: {error}"
5. Tested Failure Scenarios¶
The following scenarios are covered by integration tests (tests/integration/test_dr_recovery.py):
| Test | Scenario | Verified |
|---|---|---|
| Checkpoint-Resume | Create, checkpoint, discard, resume | State matches |
| Post-Transition Resume | Transition, checkpoint, resume | Correct state restored |
| Hash Chain Valid | Multiple transitions | Chain verification passes |
| Batch Recovery | 5 workflows, resume all | All recovered |
| Completed Exclusion | Completed workflows | Not in pending list |
| Mixed States | Workflows at different stages | All resume correctly |
| Time-Travel | Load historical checkpoint | Correct historical state |
| Export/Import | to_dict() / from_dict() round-trip | State preserved |
| Audit Trail | Multiple transitions | All recorded with hashes |
6. Current Limitations¶
| Limitation | Impact | Mitigation Path |
|---|---|---|
| Single-instance only | No automatic failover | Add async replication (Phase 2) |
| No real-time replication | Potential data loss if DB fails | Use PostgreSQL with streaming replication |
| No multi-region | Single region dependency | Deploy read replicas in secondary region |
| No point-in-time recovery | Cannot restore to arbitrary timestamp | Use PostgreSQL PITR with WAL archiving |
| No automated backup | Manual backup required | Add scheduled pg_dump or WAL archiving |
7. Phase 2 Upgrade Path¶
When deployment infrastructure is available:
- PostgreSQL streaming replication for hot standby
- WAL archiving for point-in-time recovery
- Scheduled backups via pg_dump or cloud provider snapshots
- Multi-region read replicas for geographic redundancy
- Automated failover via Patroni or cloud provider HA
- Live DR drill simulating region outage with documented runbook
8. Related Documents¶
- Gap Analysis — Issue #8 tracking
- Unified AEGIS Specification — Section 6 (Data Privacy & RBAC)
- Interface Contract — RPO/RTO parameters