ANSAI Self-Healing System Architecture¶

🎯 Overview¶

The self-healing system uses systemd's OnFailure directive to automatically trigger healing scripts when services crash. It's a fully automated, AI-powered recovery system.

🏗️ Architecture Diagram¶

┌─────────────────────────────────────────────────────────────────────┐
│                         YOUR SERVICE                                 │
│                    (my-flask-app.service)                            │
│                                                                       │
│  Status: Running → Crashes (OOM, exception, etc.)                   │
└───────────────────────────┬─────────────────────────────────────────┘
                            │
                            │ Service FAILS
                            │
                            ▼
┌─────────────────────────────────────────────────────────────────────┐
│                    SYSTEMD OnFailure Trigger                         │
│                                                                       │
│  /etc/systemd/system/my-flask-app.service.d/self-heal.conf         │
│  ┌────────────────────────────────────────────────────────┐        │
│  │ [Unit]                                                  │        │
│  │ OnFailure=my-flask-app-self-heal.service               │        │
│  └────────────────────────────────────────────────────────┘        │
│                                                                       │
│  → Immediately launches healing service when main service fails     │
└───────────────────────────┬─────────────────────────────────────────┘
                            │
                            │ Triggers (instant)
                            │
                            ▼
┌─────────────────────────────────────────────────────────────────────┐
│              SELF-HEALING SERVICE                                    │
│         (my-flask-app-self-heal.service)                            │
│                                                                       │
│  Type: oneshot (runs once, then exits)                              │
│  ExecStart: /usr/local/bin/self-heal/my-flask-app-self-heal.sh     │
│                                                                       │
│  Protection: StartLimitBurst=5 (max 5 runs in 5 minutes)           │
└───────────────────────────┬─────────────────────────────────────────┘
                            │
                            │ Executes bash script
                            │
                            ▼
┌─────────────────────────────────────────────────────────────────────┐
│                 HEALING SCRIPT WORKFLOW                              │
│              (universal-self-heal.sh)                                │
│                                                                       │
│  ┌────────────────────────────────────────────────────────────┐    │
│  │ 1. DETECTION PHASE                                         │    │
│  │    ├─ Check if service is actually down                    │    │
│  │    ├─ Gather diagnostics (logs, status, resources)         │    │
│  │    └─ Initialize report file                               │    │
│  └────────────────────────────────────────────────────────────┘    │
│                            │                                         │
│                            ▼                                         │
│  ┌────────────────────────────────────────────────────────────┐    │
│  │ 2. AI ANALYSIS (Optional, runs first)                      │    │
│  │    ├─ Collect: service status, logs, system resources      │    │
│  │    ├─ Send to AI (Groq/Ollama/OpenAI/LiteLLM)             │    │
│  │    └─ Get: Root cause + recommended fix                    │    │
│  │                                                             │    │
│  │    Example output:                                          │    │
│  │    • ROOT CAUSE: OOM killer terminated process             │    │
│  │    • WHY: Memory limit too low, memory leak in app         │    │
│  │    • FIX: Increase MemoryMax in systemd unit               │    │
│  │    • PREVENTION: Add memory monitoring                     │    │
│  └────────────────────────────────────────────────────────────┘    │
│                            │                                         │
│                            ▼                                         │
│  ┌────────────────────────────────────────────────────────────┐    │
│  │ 3. HEALING STRATEGIES (tries each in order)                │    │
│  │                                                             │    │
│  │    Strategy 1: SERVICE RESTART                             │    │
│  │    ├─ systemctl restart service                            │    │
│  │    ├─ Wait 5 seconds                                       │    │
│  │    └─ Check if active → ✅ SUCCESS → Send alert → EXIT    │    │
│  │                          ↓                                  │    │
│  │                          ❌ Failed                          │    │
│  │                          ↓                                  │    │
│  │    Strategy 2: PORT CONFLICT                               │    │
│  │    ├─ Check if port is occupied (lsof)                     │    │
│  │    ├─ If stale process: kill -9 PID                        │    │
│  │    ├─ Restart service                                      │    │
│  │    └─ Check if active → ✅ SUCCESS → Send alert → EXIT    │    │
│  │                          ↓                                  │    │
│  │                          ❌ Failed                          │    │
│  │                          ↓                                  │    │
│  │    Strategy 3: CONFIG VALIDATION                           │    │
│  │    ├─ Run service-specific validation (e.g., traefik)      │    │
│  │    ├─ Check config syntax                                  │    │
│  │    └─ If valid: restart → ✅ SUCCESS → Send alert → EXIT  │    │
│  │                                ↓                            │    │
│  │                                ❌ Failed                    │    │
│  └────────────────────────────────────────────────────────────┘    │
│                            │                                         │
│                            ▼                                         │
│  ┌────────────────────────────────────────────────────────────┐    │
│  │ 4. ALERT DISPATCH                                          │    │
│  │                                                             │    │
│  │    Success (service recovered):                            │    │
│  │    ├─ Email: ✅ "Service Restarted - Auto-Healed"         │    │
│  │    ├─ Webhook: Slack/Discord notification                  │    │
│  │    ├─ Journal: systemd-cat to journald                     │    │
│  │    └─ Details: What failed, how fixed, healing time        │    │
│  │                                                             │    │
│  │    Failure (all strategies failed):                        │    │
│  │    ├─ Email: ❌ "URGENT: Manual Intervention Required"     │    │
│  │    ├─ Webhook: Critical alert                              │    │
│  │    ├─ Journal: systemd-cat to journald                     │    │
│  │    └─ Details: What was tried, next steps, SSH commands    │    │
│  └────────────────────────────────────────────────────────────┘    │
│                            │                                         │
│                            ▼                                         │
│  ┌────────────────────────────────────────────────────────────┐    │
│  │ 5. LOGGING & TRACKING                                      │    │
│  │    ├─ /var/log/self-heal-{service}.log (persistent)        │    │
│  │    ├─ journalctl -t ansai-self-heal (systemd journal)      │    │
│  │    └─ /var/run/{service}-heal-status (state file)          │    │
│  └────────────────────────────────────────────────────────────┘    │
│                                                                       │
└───────────────────────────┬─────────────────────────────────────────┘
                            │
                            │ Exit (success or failure)
                            │
                            ▼
┌─────────────────────────────────────────────────────────────────────┐
│                    YOUR SERVICE                                      │
│               Status: Running (healed) ✅                            │
│                   or: Still Down ❌                                  │
└─────────────────────────────────────────────────────────────────────┘

🔄 Typical Healing Flow (Timeline)¶

T+0s    : Flask app crashes (OOM killer, exception, etc.)
T+0.1s  : systemd detects failure
T+0.2s  : systemd triggers OnFailure → launches my-flask-app-self-heal.service
T+0.3s  : Healing script starts
T+0.5s  : Gather diagnostics (logs, status, system info)
T+1s    : AI analysis begins (if enabled)
T+3s    : AI returns root cause analysis
T+3.5s  : Strategy 1: Attempt service restart
T+3.6s  : Execute: systemctl restart my-flask-app.service
T+8.6s  : Wait 5 seconds for service to stabilize
T+8.7s  : Check: systemctl is-active → ✅ SUCCESS!
T+9s    : Generate detailed report
T+10s   : Send email notification
T+10.5s : Log to journald
T+11s   : Healing script exits
T+11s   : Your service is RUNNING again

Total downtime: ~11 seconds (most users never notice)

🧠 AI Analysis Example¶

The system sends this context to the AI:

SERVICE FAILURE REPORT:
Service Name: my-flask-app
Port: 5000

SERVICE STATUS:
● my-flask-app.service - My Flask Application
   Loaded: loaded
   Active: failed (Result: signal)
   Main PID: 12345 (code=killed, signal=KILL)

RECENT LOGS:
Dec 05 10:30:45 testserver my-flask-app[12345]: MemoryError: out of memory
Dec 05 10:30:46 testserver kernel: Out of memory: Killed process 12345

SYSTEM RESOURCES:
Memory: 7.5G/8G (94%)
Disk: 45G/100G (45%)

AI Response (in ~2 seconds):

ROOT CAUSE: OOM (Out of Memory) killer terminated the process

WHY IT FAILED:
• Application exceeded available memory (94% usage)
• No memory limit set in systemd unit
• Likely memory leak or inefficient caching

RECOMMENDED FIX:
1. Immediate: systemctl restart my-flask-app.service
2. Short-term: Add to unit file:
   [Service]
   MemoryMax=2G
3. Long-term: Profile app with memory_profiler

PREVENTION: Monitor memory usage, add alerts at 80% threshold

📊 Key Components¶

1. Systemd Override Files¶

/etc/systemd/system/my-flask-app.service.d/
└── self-heal.conf
    [Unit]
    OnFailure=my-flask-app-self-heal.service

2. Healing Service Unit¶

/etc/systemd/system/my-flask-app-self-heal.service
    Type=oneshot
    ExecStart=/usr/local/bin/self-heal/my-flask-app-self-heal.sh

3. Healing Script¶

/usr/local/bin/self-heal/my-flask-app-self-heal.sh
    - 700+ lines of bash
    - Multiple healing strategies
    - AI integration
    - Email/webhook alerts

4. Status Dashboard¶

/usr/local/bin/testserver-status
    Shows real-time status of all monitored services

🛡️ Protection Mechanisms¶

Rate Limiting¶

StartLimitBurst=5
StartLimitIntervalSec=300

- Max 5 healing attempts in 5 minutes - Prevents infinite restart loops - If exceeded: service marked as failed, requires manual intervention

Smart Detection¶

Only heals if service is actually down
Distinguishes between manual stops vs crashes
Won't interfere with intentional maintenance

Escalation Path¶

If all strategies fail: 1. Detailed email with failure analysis 2. AI-suggested next steps 3. Exact SSH commands to run 4. Links to relevant logs

📧 Alert Examples¶

Success Email¶

✅ ANSAI: my-flask-app - RESOLVED

Service: my-flask-app
Domain: app.example.com
Port: 5000

AUTOMATIC ISSUE RESOLUTION

ISSUE DETECTED: my-flask-app is not running
DIAGNOSIS: Service crashed with exit code 137 (OOM)

AI ROOT CAUSE ANALYSIS:
[AI analysis here]

HEALING STRATEGY: Standard Service Restart
✅ SUCCESS: Service restarted and is now active

HOW IT WAS FIXED:
1. Detected my-flask-app was inactive
2. Executed: systemctl restart my-flask-app.service
3. Waited 5 seconds
4. Verified service is active
5. Service listening on port 5000

HEALING TIME: ~5 seconds
CONFIDENCE: High

View logs: journalctl -u my-flask-app -n 100

Failure Email¶

❌ ANSAI: my-flask-app - FAILED

ALL HEALING STRATEGIES FAILED

Service: my-flask-app
Status: STILL DOWN after healing attempts

MANUAL INTERVENTION REQUIRED:

1. SSH to server:
   ssh [email protected]

2. Check service status:
   sudo systemctl status my-flask-app.service

3. View recent logs:
   sudo journalctl -u my-flask-app -n 100

4. Try manual restart:
   sudo systemctl restart my-flask-app.service

⚠️ CRITICAL SERVICE - IMMEDIATE ATTENTION REQUIRED

🎛️ Configuration¶

All settings in: orchestrators/ansible/roles/universal_self_heal/defaults/main.yml

Services to Monitor¶

monitored_services:
  - name: my-flask-app
    port: 5000
    domain: app.example.com
    critical: true
    healing_strategies:
      - service_restart
      - database_check
      - port_conflict
      - environment_check

AI Backend Options¶

ai_backend: groq  # groq, ollama, litellm, openai

# Groq (default - fast & free)
groq_api_key: "{{ lookup('env', 'ANSAI_GROQ_API_KEY') }}"
groq_model: llama-3.1-8b-instant

# Ollama (local - no API keys)
ollama_url: http://localhost:11434
ollama_model: llama3

Alert Methods¶

alert_method: email  # email, webhook, both, none

# Email via SMTP
smtp_server: smtp.gmail.com
smtp_port: 587
smtp_user: [email protected]

# Webhook (Slack/Discord)
webhook_url: "https://hooks.slack.com/..."
webhook_format: slack  # slack, discord, generic

🚀 Benefits¶

Automation¶

✅ Zero human intervention for 90%+ of failures ✅ Average recovery time: 5-15 seconds ✅ Works 24/7, even while you sleep

Intelligence¶

✅ AI-powered root cause analysis ✅ Multiple healing strategies ✅ Learns from failure patterns

Observability¶

✅ Detailed email reports with exact steps taken ✅ Systemd journal integration ✅ Persistent logs for forensics

Reliability¶

✅ Rate limiting prevents restart storms ✅ Distinguishes crash vs manual stop ✅ Escalates to human when needed

🎯 Philosophy¶

This system embodies the ANSAI philosophy: - Automate ruthlessly - Systems beat willpower - Observe everything - Can't fix what you can't see - Config as Code - Infrastructure is code, version controlled - Self-healing - Systems should recover without human intervention

It's the DevOps equivalent of "pay yourself first" - automate the critical stuff so you can focus on what matters.

📝 Commands¶

# View all healing logs
journalctl -t ansai-self-heal

# View specific service logs
journalctl -u my-flask-app-self-heal

# Check status of all monitored services
testserver-status

# Test the healing system (will auto-recover)
sudo systemctl stop my-flask-app
watch systemctl status my-flask-app

# View healing status file
cat /var/run/my-flask-app-heal-status

# View persistent log
tail -f /var/log/self-heal-my-flask-app.log

Bottom Line: Your services become self-healing organisms. They detect failures, diagnose issues with AI, attempt multiple fixes, and report results - all in seconds, without human intervention. 🤖