ANSAI Self-Healing System Architecture¶
๐ฏ Overview¶
The self-healing system uses systemd's OnFailure directive to automatically trigger healing scripts when services crash. It's a fully automated, AI-powered recovery system.
๐๏ธ Architecture Diagram¶
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ YOUR SERVICE โ
โ (my-flask-app.service) โ
โ โ
โ Status: Running โ Crashes (OOM, exception, etc.) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โ Service FAILS
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ SYSTEMD OnFailure Trigger โ
โ โ
โ /etc/systemd/system/my-flask-app.service.d/self-heal.conf โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ [Unit] โ โ
โ โ OnFailure=my-flask-app-self-heal.service โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ
โ โ Immediately launches healing service when main service fails โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โ Triggers (instant)
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ SELF-HEALING SERVICE โ
โ (my-flask-app-self-heal.service) โ
โ โ
โ Type: oneshot (runs once, then exits) โ
โ ExecStart: /usr/local/bin/self-heal/my-flask-app-self-heal.sh โ
โ โ
โ Protection: StartLimitBurst=5 (max 5 runs in 5 minutes) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โ Executes bash script
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ HEALING SCRIPT WORKFLOW โ
โ (universal-self-heal.sh) โ
โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ 1. DETECTION PHASE โ โ
โ โ โโ Check if service is actually down โ โ
โ โ โโ Gather diagnostics (logs, status, resources) โ โ
โ โ โโ Initialize report file โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ โ
โ โผ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ 2. AI ANALYSIS (Optional, runs first) โ โ
โ โ โโ Collect: service status, logs, system resources โ โ
โ โ โโ Send to AI (Groq/Ollama/OpenAI/LiteLLM) โ โ
โ โ โโ Get: Root cause + recommended fix โ โ
โ โ โ โ
โ โ Example output: โ โ
โ โ โข ROOT CAUSE: OOM killer terminated process โ โ
โ โ โข WHY: Memory limit too low, memory leak in app โ โ
โ โ โข FIX: Increase MemoryMax in systemd unit โ โ
โ โ โข PREVENTION: Add memory monitoring โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ โ
โ โผ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ 3. HEALING STRATEGIES (tries each in order) โ โ
โ โ โ โ
โ โ Strategy 1: SERVICE RESTART โ โ
โ โ โโ systemctl restart service โ โ
โ โ โโ Wait 5 seconds โ โ
โ โ โโ Check if active โ โ
SUCCESS โ Send alert โ EXIT โ โ
โ โ โ โ โ
โ โ โ Failed โ โ
โ โ โ โ โ
โ โ Strategy 2: PORT CONFLICT โ โ
โ โ โโ Check if port is occupied (lsof) โ โ
โ โ โโ If stale process: kill -9 PID โ โ
โ โ โโ Restart service โ โ
โ โ โโ Check if active โ โ
SUCCESS โ Send alert โ EXIT โ โ
โ โ โ โ โ
โ โ โ Failed โ โ
โ โ โ โ โ
โ โ Strategy 3: CONFIG VALIDATION โ โ
โ โ โโ Run service-specific validation (e.g., traefik) โ โ
โ โ โโ Check config syntax โ โ
โ โ โโ If valid: restart โ โ
SUCCESS โ Send alert โ EXIT โ โ
โ โ โ โ โ
โ โ โ Failed โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ โ
โ โผ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ 4. ALERT DISPATCH โ โ
โ โ โ โ
โ โ Success (service recovered): โ โ
โ โ โโ Email: โ
"Service Restarted - Auto-Healed" โ โ
โ โ โโ Webhook: Slack/Discord notification โ โ
โ โ โโ Journal: systemd-cat to journald โ โ
โ โ โโ Details: What failed, how fixed, healing time โ โ
โ โ โ โ
โ โ Failure (all strategies failed): โ โ
โ โ โโ Email: โ "URGENT: Manual Intervention Required" โ โ
โ โ โโ Webhook: Critical alert โ โ
โ โ โโ Journal: systemd-cat to journald โ โ
โ โ โโ Details: What was tried, next steps, SSH commands โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ โ
โ โผ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ 5. LOGGING & TRACKING โ โ
โ โ โโ /var/log/self-heal-{service}.log (persistent) โ โ
โ โ โโ journalctl -t ansai-self-heal (systemd journal) โ โ
โ โ โโ /var/run/{service}-heal-status (state file) โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โ Exit (success or failure)
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ YOUR SERVICE โ
โ Status: Running (healed) โ
โ
โ or: Still Down โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
๐ Typical Healing Flow (Timeline)¶
T+0s : Flask app crashes (OOM killer, exception, etc.)
T+0.1s : systemd detects failure
T+0.2s : systemd triggers OnFailure โ launches my-flask-app-self-heal.service
T+0.3s : Healing script starts
T+0.5s : Gather diagnostics (logs, status, system info)
T+1s : AI analysis begins (if enabled)
T+3s : AI returns root cause analysis
T+3.5s : Strategy 1: Attempt service restart
T+3.6s : Execute: systemctl restart my-flask-app.service
T+8.6s : Wait 5 seconds for service to stabilize
T+8.7s : Check: systemctl is-active โ โ
SUCCESS!
T+9s : Generate detailed report
T+10s : Send email notification
T+10.5s : Log to journald
T+11s : Healing script exits
T+11s : Your service is RUNNING again
Total downtime: ~11 seconds (most users never notice)
๐ง AI Analysis Example¶
The system sends this context to the AI:
SERVICE FAILURE REPORT:
Service Name: my-flask-app
Port: 5000
SERVICE STATUS:
โ my-flask-app.service - My Flask Application
Loaded: loaded
Active: failed (Result: signal)
Main PID: 12345 (code=killed, signal=KILL)
RECENT LOGS:
Dec 05 10:30:45 testserver my-flask-app[12345]: MemoryError: out of memory
Dec 05 10:30:46 testserver kernel: Out of memory: Killed process 12345
SYSTEM RESOURCES:
Memory: 7.5G/8G (94%)
Disk: 45G/100G (45%)
AI Response (in ~2 seconds):
ROOT CAUSE: OOM (Out of Memory) killer terminated the process
WHY IT FAILED:
โข Application exceeded available memory (94% usage)
โข No memory limit set in systemd unit
โข Likely memory leak or inefficient caching
RECOMMENDED FIX:
1. Immediate: systemctl restart my-flask-app.service
2. Short-term: Add to unit file:
[Service]
MemoryMax=2G
3. Long-term: Profile app with memory_profiler
PREVENTION: Monitor memory usage, add alerts at 80% threshold
๐ Key Components¶
1. Systemd Override Files¶
/etc/systemd/system/my-flask-app.service.d/
โโโ self-heal.conf
[Unit]
OnFailure=my-flask-app-self-heal.service
2. Healing Service Unit¶
/etc/systemd/system/my-flask-app-self-heal.service
Type=oneshot
ExecStart=/usr/local/bin/self-heal/my-flask-app-self-heal.sh
3. Healing Script¶
/usr/local/bin/self-heal/my-flask-app-self-heal.sh
- 700+ lines of bash
- Multiple healing strategies
- AI integration
- Email/webhook alerts
4. Status Dashboard¶
๐ก๏ธ Protection Mechanisms¶
Rate Limiting¶
- Max 5 healing attempts in 5 minutes - Prevents infinite restart loops - If exceeded: service marked as failed, requires manual interventionSmart Detection¶
- Only heals if service is actually down
- Distinguishes between manual stops vs crashes
- Won't interfere with intentional maintenance
Escalation Path¶
If all strategies fail: 1. Detailed email with failure analysis 2. AI-suggested next steps 3. Exact SSH commands to run 4. Links to relevant logs
๐ง Alert Examples¶
Success Email¶
โ
ANSAI: my-flask-app - RESOLVED
Service: my-flask-app
Domain: app.example.com
Port: 5000
AUTOMATIC ISSUE RESOLUTION
ISSUE DETECTED: my-flask-app is not running
DIAGNOSIS: Service crashed with exit code 137 (OOM)
AI ROOT CAUSE ANALYSIS:
[AI analysis here]
HEALING STRATEGY: Standard Service Restart
โ
SUCCESS: Service restarted and is now active
HOW IT WAS FIXED:
1. Detected my-flask-app was inactive
2. Executed: systemctl restart my-flask-app.service
3. Waited 5 seconds
4. Verified service is active
5. Service listening on port 5000
HEALING TIME: ~5 seconds
CONFIDENCE: High
View logs: journalctl -u my-flask-app -n 100
Failure Email¶
โ ANSAI: my-flask-app - FAILED
ALL HEALING STRATEGIES FAILED
Service: my-flask-app
Status: STILL DOWN after healing attempts
MANUAL INTERVENTION REQUIRED:
1. SSH to server:
ssh [email protected]
2. Check service status:
sudo systemctl status my-flask-app.service
3. View recent logs:
sudo journalctl -u my-flask-app -n 100
4. Try manual restart:
sudo systemctl restart my-flask-app.service
โ ๏ธ CRITICAL SERVICE - IMMEDIATE ATTENTION REQUIRED
๐๏ธ Configuration¶
All settings in: orchestrators/ansible/roles/universal_self_heal/defaults/main.yml
Services to Monitor¶
monitored_services:
- name: my-flask-app
port: 5000
domain: app.example.com
critical: true
healing_strategies:
- service_restart
- database_check
- port_conflict
- environment_check
AI Backend Options¶
ai_backend: groq # groq, ollama, litellm, openai
# Groq (default - fast & free)
groq_api_key: "{{ lookup('env', 'ANSAI_GROQ_API_KEY') }}"
groq_model: llama-3.1-8b-instant
# Ollama (local - no API keys)
ollama_url: http://localhost:11434
ollama_model: llama3
Alert Methods¶
alert_method: email # email, webhook, both, none
# Email via SMTP
smtp_server: smtp.gmail.com
smtp_port: 587
smtp_user: [email protected]
# Webhook (Slack/Discord)
webhook_url: "https://hooks.slack.com/..."
webhook_format: slack # slack, discord, generic
๐ Benefits¶
Automation¶
โ Zero human intervention for 90%+ of failures โ Average recovery time: 5-15 seconds โ Works 24/7, even while you sleep
Intelligence¶
โ AI-powered root cause analysis โ Multiple healing strategies โ Learns from failure patterns
Observability¶
โ Detailed email reports with exact steps taken โ Systemd journal integration โ Persistent logs for forensics
Reliability¶
โ Rate limiting prevents restart storms โ Distinguishes crash vs manual stop โ Escalates to human when needed
๐ฏ Philosophy¶
This system embodies the ANSAI philosophy: - Automate ruthlessly - Systems beat willpower - Observe everything - Can't fix what you can't see - Config as Code - Infrastructure is code, version controlled - Self-healing - Systems should recover without human intervention
It's the DevOps equivalent of "pay yourself first" - automate the critical stuff so you can focus on what matters.
๐ Commands¶
# View all healing logs
journalctl -t ansai-self-heal
# View specific service logs
journalctl -u my-flask-app-self-heal
# Check status of all monitored services
testserver-status
# Test the healing system (will auto-recover)
sudo systemctl stop my-flask-app
watch systemctl status my-flask-app
# View healing status file
cat /var/run/my-flask-app-heal-status
# View persistent log
tail -f /var/log/self-heal-my-flask-app.log
Bottom Line: Your services become self-healing organisms. They detect failures, diagnose issues with AI, attempt multiple fixes, and report results - all in seconds, without human intervention. ๐ค