# EDA_Demo_For_SubHealthStateDetection_And_Self-Healing **Repository Path**: fengyuancheung/eda_-demo_-for_-sub-health-state-detection_-and_-self-healing ## Basic Information - **Project Name**: EDA_Demo_For_SubHealthStateDetection_And_Self-Healing - **Description**: EDA_Demo_For_Sub-Health State Detection & Self-Healing in IT Systems - **Primary Language**: YAML - **License**: LGPL-3.0 - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2025-04-03 - **Last Updated**: 2025-04-03 ## Categories & Tags **Categories**: Uncategorized **Tags**: EDA, 故障自愈, 亚健康检测, demo ## README # EDA Demo: System Sub-Health Detection & Self-Healing Automation #### 介绍 EDA_Demo_For_Sub-Health State Detection & Self-Healing in IT Systems ## 1. Core Objectives of This EDA Demo **Real-time system monitoring with automated response:** Collect operational metrics continuously, identify sub-health states using predefined rules, and trigger automated remediation workflows. ### 1.1 Sub-Health State Detection - **Technical Implementation**: - Detect system anomalies via real-time monitoring and data pattern analysis - Identify emerging issues including: » Performance degradation (e.g., SLA deviations) » Abnormal resource utilization (CPU/memory/disk I/O) » Latency spikes in critical services ### 1.2 Self-Healing Mechanism - **Key Features**: - Automated remediation for common failure scenarios: ```text - Scenario 1:Using EDA + Prometheus + Alertmanager to achieve collect perf data automatically for further TS - Scenario 2:Using EDA + Prometheus + Alertmanager to send disk usage alert to administrator’s console - Scenario 3:Using EDA + webhook to notice administrator for unhealthy Linux status detecting - Scenario 4:Using EDA + webhook to auto create firewall rules when detecting some “hacking” behavior ``` - Fully API-driven integration with enterprise operations platforms - Human confirmation required for critical systems (ISO 27001 compliance) ## 2. Solution Architecture ### Core Components and Version Compatibility | Component | Mandatory | Version Requirements | Protocol Support | |--------------------|-----------|----------------------|-------------------| | Prometheus | ✔️ | ≥ v2.40 | HTTP/HTTPS/gRPC | | Alertmanager | ❌ (Opt.) | v0.25+ | Webhook | | Node_exporter | ❌ (Opt.) | ≥ 1.5 | TLS 1.2+ | | Ansible AAP 2.5 | ✔️ | 2.5.x+ | REST API | | Grafana | ❌ (Opt.) | ≥ 9.4 LTS | OAuth 2.0 | ### Solution Architectural Blueprint ![输入图片说明](image1.png) #### 软件架构 ![输入图片说明](image2.png) #### DEMO - Scenario 1:Using EDA + Prometheus + Alertmanager to achieve collect perf data automatically for further TS - Basic Principles: - 1. Prometheus/Alertmanager keep watching all target hosts - 2. EDA runs rulebook with webhook to listen for defined events from Alertmanager - 3. Any target hosts with High CPU usage(90%+) being detected, Alertmanager will report “Firing” - 4. EDA got the event, trigger AAP job template to collect perf data & sosreport for root cause analyzing afterward - 5 Using perf report & xsos to analyze sosreport - DEMO Steps: ![输入图片说明](image3.png) - Scenario 2:Using EDA + Prometheus + Alertmanager to send disk usage alert to administrator’s console - Basic Principles: - 1. Premetheus/Alertmanager keep watching all target hosts - 2. EDA runs rulebook with webhook to listen for defined events from Alertmanager - 3. Any target hosts with High filesystem usage(80%+) being detected, Alertmanager will report “Firing” - 4. EDA got the event, trigger AAP workflow job template: - 1) scan all target hosts and find the hostname with high used filesystem - 2) send alert to admin’s console - DEMO Steps: ![输入图片说明](image4.png) - Scenario 3:Using EDA + webhook to notice administrator for unhealthy Linux status detecting - Basic Principles: - 1. EDA runs rulebook with webhook to listen for defined coming events - 2. A local script runs on target host and monitors /var/log/messages, triggers webhook when detect the defined key word such as “Call Trace” from messages file - 3. EDA got event from webhook and call a local playbook to notice admin to take some actions - DEMO Steps: ![输入图片说明](image33.png) - Scenario 4:Using EDA + webhook to auto create firewall rules when detecting some “hacking” behavior - Basic Principles: - 1. EDA runs rulebook with webhook to listen for defined events - 2. A local script runs on target hosts and monitors /var/log/secure, create a line of log with defined key word in /var/log/messages when detect more than “6 Failed ssh attempts from < Hacking IP> ” in the latest /var/log/secure - 3. Trigger webhook when the defined key word in /var/log/messages - 4. EDA detect event from webhook and automatically create firewall rules to block “hakcing IP” - DEMO Steps: ![输入图片说明](image.png) #### Expected Result - Scenarion 1 ![输入图片说明](image1212.png) - Scenarion 2 ![输入图片说明](image2134.png) - Scenarion 3 N/A - Scenarion 4 ![输入图片说明](image3456.png) #### 参与贡献 1. Fengyuan Zhang/ fzhang@redhat.com 2. Jerry Wang / Jewang@redhat.com