Index of /rozprawy2/10810

Pełen tekst

(1)AGH University of Science and Technology Cracow. Ph.D. Thesis. Self-healing, highly available monitoring system for distributed environment mgr inż. Piotr Pęgiel. Supervisor. prof. dr hab. inż. Jacek Kitowski Auxiliary Supervisor. dr inż. Włodzimierz Funika. Faculty of Computer Science, Electronics and Telecommunications Department of Computer Science Cracow 2014.

(2)

(3) Akademia Górniczo-Hutnicza im. Stanisława Staszica w Krakowie. Rozprawa Doktorska. System o wysokiej dostępności monitorujący środowiska rozproszone, posiadający zdolność samonaprawy mgr inż. Piotr Pęgiel. Promotor. prof. dr hab. inż. Jacek Kitowski Promotor Pomocniczy. dr inż. Włodzimierz Funika. Wydział Informatyki, Elektroniki i Telekomunikacji Katedra Informatyki Kraków 2014.

(4)

(5) I would like to dedicate this thesis to my wife, Asia.

(6) Acknowledgements I owe an enormous debt of gratitude to Supervisor prof. dr. hab. inż. Jacek Kitowski and Auxiliary Supervisor dr inż. Włodzimierz Funika for their extremely helpful remarks, the patience and assistance with the research described in this thesis. Special thanks to my wife, my parents and my whole family who believed in me and motivated me. This work was realised partially in the framework of the EU Virtual Physiological Human: Sharing for Healthcare (VPH-Share) project under the Information Communication Technologies Programme (contract number 269978)..

(7) Contents. List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . List of Listings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chapter 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1. 1.2. 1.3. 1.4. 1.5. 1.6. 1.7. 1.8. 1.9.. Motivation . . . . . . . . . High availability . . . . . . Self-healing . . . . . . . . Agent based systems . . . Reliability . . . . . . . . . Keywords and Definitions Objectives . . . . . . . . . Research Contribution . . Organization of Thesis . .. . . . . . . . . .. . . . . . . . . .. . . . . . . . . .. . . . . . . . . .. . . . . . . . . .. . . . . . . . . .. . . . . . . . . .. . . . . . . . . .. . . . . . . . . .. . . . . . . . . .. . . . . . . . . .. . . . . . . . . .. . . . . . . . . .. . . . . . . . . .. . . . . . . . . .. . . . . . . . . .. . . . . . . . . .. . . . . . . . . .. . . . . . . . . .. . . . . . . . . .. . . . . . . . . .. . . . . . . . . .. . . . . . . . . .. . . . . . . . . .. . . . . . . . . .. . . . . . . . . .. . . . . . . . . .. . . . . . . . . .. . . . . . . . . .. Chapter 2. Technology background . . . . . . . . . . . . . . . . . . . . . . . . 2.1. Discovery and Communication . . . . . . . . . . 2.1.1. Service Location Protocol – SLP . . . . . . 2.1.2. Simple Service Discovery Protocol (SSDP) 2.1.3. JINI . . . . . . . . . . . . . . . . . . . . . . 2.1.4. JXTA . . . . . . . . . . . . . . . . . . . . . 2.1.5. JGroups . . . . . . . . . . . . . . . . . . . . 2.2. Election algorithms . . . . . . . . . . . . . . . . 2.2.1. Bully Algorithm . . . . . . . . . . . . . . . 2.2.2. Ring Algorithm . . . . . . . . . . . . . . . . 2.2.3. Santoro Rotem Algorithm . . . . . . . . . . 2.2.4. Other algorithms . . . . . . . . . . . . . . . 2.3. Overview of existing monitoring systems . . . . 2.3.1. Ganglia . . . . . . . . . . . . . . . . . . . . 2.3.2. AutoPilot . . . . . . . . . . . . . . . . . . . 2.3.3. Gemini . . . . . . . . . . . . . . . . . . . . 2.3.4. Aksum and JavaPSL . . . . . . . . . . . . . 2.3.5. OCM-G / G-PM . . . . . . . . . . . . . . . 2.3.6. J-OCM . . . . . . . . . . . . . . . . . . . . 2.3.7. JXM . . . . . . . . . . . . . . . . . . . . . . 2.3.8. SemMon . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . .. 5 8 9 9 10 12 14 15 17 17 18 18 19 19 20 23 26 28 29 30 31 32 33 34 34 35 35 36 36 37 37 38 38. 1.

(8) Contents 2.3.9. Dynamic Monitoring Framework . 2.4. Self-healing and Self-adaptive Software 2.4.1. PANACEA . . . . . . . . . . . . . 2.4.2. MUSIC . . . . . . . . . . . . . . . 2.4.3. GRAVA . . . . . . . . . . . . . . . 2.5. Summary . . . . . . . . . . . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. 39 40 40 41 42 42. Chapter 3. Use Cases and Requirements . . . . . . . . . . . . . . . . . . . .. 44 44 45 45 45 46 46 47 48 49 53 55. 3.1. Introduction . . . . . . . . . . . . . . . . . . . 3.2. Use cases . . . . . . . . . . . . . . . . . . . . 3.2.1. User use cases . . . . . . . . . . . . . . . 3.2.2. Advanced User / Administrator use cases 3.2.3. Application Programmer use cases . . . . 3.3. System Requirements Definition . . . . . . . . 3.3.1. Functional Requirements . . . . . . . . . 3.3.2. Non-functional Requirements . . . . . . . 3.4. Functional Requirements Specification . . . . 3.5. Non-functional Requirements Specification . . 3.6. Summary . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . .. . . . . . . . . . . .. Chapter 4. Architecture and Design . . . . . . . . . . . . . . . . . . . . . . . 4.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.1. Distributed system . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.2. Multi-Agent based System (MAS) . . . . . . . . . . . . . . . . . . 4.1.3. Self-healing, distributed monitoring system architecture . . . . . . 4.2. AgeMon architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3. Agent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.1. Agent Name and Collision Detection . . . . . . . . . . . . . . . . . 4.3.2. Monitoring Result, Monitoring Source and Measurable Capability 4.3.3. Messaging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4. Agent Communication Layer . . . . . . . . . . . . . . . . . . . . . . . . 4.5. Discovery – finding agents . . . . . . . . . . . . . . . . . . . . . . . . . 4.5.1. Discovery based on the multicast . . . . . . . . . . . . . . . . . . . 4.5.2. Discovery based on the Gossip Server . . . . . . . . . . . . . . . . 4.5.3. Tunneling & Discovery . . . . . . . . . . . . . . . . . . . . . . . . . 4.5.4. Mixed solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6. Roles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.7. GUI Role . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.7.1. Monitoring Manager . . . . . . . . . . . . . . . . . . . . . . . . . . 4.7.2. Visualization and Visualization Manager . . . . . . . . . . . . . . 4.7.3. Rules Manager Component . . . . . . . . . . . . . . . . . . . . . . 4.8. Monitoring Role . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.8.1. Handling feedbacks – enabling the healing of the monitored system 4.9. Persistence Role . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.10. Rule Role . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.11. Command Line Interface Role . . . . . . . . . . . . . . . . . . . . . . . 4.12. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . .. Chapter 5. System implementation . . . . . . . . . . . . . . . . . . . . . . . . 5.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 2. 56 56 56 57 58 59 62 62 63 64 65 67 67 68 68 70 70 70 72 73 74 74 75 77 79 81 82 84 84.

(9) Contents 5.2. Agent and Agent Group . . . . . . . . . . . 5.3. Role implementation . . . . . . . . . . . . . 5.4. Monitoring Role . . . . . . . . . . . . . . . . 5.5. GUI Role . . . . . . . . . . . . . . . . . . . 5.5.1. AgentGraph . . . . . . . . . . . . . . . 5.5.2. Monitoring Component . . . . . . . . . 5.5.3. Rule Component . . . . . . . . . . . . . 5.5.4. Persistence Component . . . . . . . . . 5.5.5. Visualisation Component . . . . . . . . 5.5.6. Summary . . . . . . . . . . . . . . . . . 5.6. Rule Role . . . . . . . . . . . . . . . . . . . 5.7. Persistence Role . . . . . . . . . . . . . . . . 5.8. CLI Role . . . . . . . . . . . . . . . . . . . . 5.9. System-wide architectural concepts . . . . . 5.9.1. Service Interface and Start Up/Shutting 5.9.2. Foundation Services . . . . . . . . . . . 5.9.3. Dependency Injection . . . . . . . . . . 5.10. Summary . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Down Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . .. 87 88 89 93 94 96 98 100 101 105 106 111 113 117 117 119 120 121. Chapter 6. Communication in the AgeMon System . . . . . . . . . . . . . 122 6.1. 6.2. 6.3. 6.4. 6.5.. Transport Layer . . . . . . Abstract Agent Layer . . . Abstract Monitoring Layer Message Types . . . . . . Summary . . . . . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. 123 124 125 127 144. Chapter 7. High Availability and Self-Healing . . . . . . . . . . . . . . . . 145 7.1. High Availability and Self-Healing in Monitoring System . 7.1.1. Automatic Discovery . . . . . . . . . . . . . . . . . . . 7.1.2. Reliable Transport Protocols . . . . . . . . . . . . . . 7.1.3. Network failures tolerance . . . . . . . . . . . . . . . . 7.1.4. Absence of Single Point of Failure . . . . . . . . . . . 7.1.5. Roles Redundancy . . . . . . . . . . . . . . . . . . . . 7.1.6. Substitute Agents – Failover . . . . . . . . . . . . . . 7.1.7. Cooperative Mode – Rules . . . . . . . . . . . . . . . 7.1.8. Advanced Rules – Self-Healing . . . . . . . . . . . . . 7.1.9. High Availability/Self-Healing strategies – a summary 7.2. Healing of the monitored system . . . . . . . . . . . . . . 7.2.1. Transparent Healing . . . . . . . . . . . . . . . . . . . 7.2.2. Application Aware Healing . . . . . . . . . . . . . . . 7.3. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . .. . . . . . . . . . . . . . .. . . . . . . . . . . . . . .. . . . . . . . . . . . . . .. . . . . . . . . . . . . . .. . . . . . . . . . . . . . .. . . . . . . . . . . . . . .. . . . . . . . . . . . . . .. . . . . . . . . . . . . . .. . . . . . . . . . . . . . .. . . . . . . . . . . . . . .. 145 146 146 146 147 147 148 151 152 152 153 153 155 156. Chapter 8. Testing and System Deployment . . . . . . . . . . . . . . . . . . 158 8.1. Performance Tests . . . . . 8.1.1. Scenario 1 . . . . . . . . 8.1.2. Scenario 2 . . . . . . . . 8.1.3. Scenario 3 . . . . . . . . 8.1.4. Performance Summary . 8.2. Performance of the low-level. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . communication. . . . . . . . . . . . . . . . layer. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. 159 160 162 162 164 167. 3.

(10) Contents 8.3. Latency in the system . . . . . . . . 8.3.1. Decision Time . . . . . . . . . . 8.3.2. Persistence Time . . . . . . . . . 8.3.3. Network latencies . . . . . . . . 8.3.4. Latency summary . . . . . . . . 8.4. Self-Healing Tests / High Availability 8.4.1. Substitute Persistence Agent . . 8.4.2. Restart Monitoring Agent . . . . 8.5. Healing SUM . . . . . . . . . . . . . 8.6. Soak test . . . . . . . . . . . . . . . . 8.7. Reliability . . . . . . . . . . . . . . . 8.8. Code quality . . . . . . . . . . . . . . 8.8.1. Code statistics . . . . . . . . . . 8.8.2. Violations . . . . . . . . . . . . . 8.8.3. Duplications . . . . . . . . . . . 8.8.4. Cyclomatic Complexity . . . . . 8.8.5. LCOM4 . . . . . . . . . . . . . . 8.8.6. Code Quality – Summary . . . . 8.9. Summary . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .. 167 168 170 171 171 172 172 174 175 176 176 177 177 178 178 179 179 179 180. Chapter 9. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181 9.1. Thesis and Requirements verification . . . . . . . . . . . . . . . . . 9.2. Novel concepts introduced . . . . . . . . . . . . . . . . . . . . . . . 9.3. Future development . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.3.1. Enlarge Agent Autonomy . . . . . . . . . . . . . . . . . . . . . 9.3.2. Reasoning from the historical results – Advanced data analysis 9.3.3. Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.3.4. Advanced metrics / Transformations . . . . . . . . . . . . . . . 9.3.5. Rules and Action Enhacements . . . . . . . . . . . . . . . . . . 9.3.6. Complex Event Processing . . . . . . . . . . . . . . . . . . . . 9.3.7. Advanced Fault Detection in Monitoring System . . . . . . . . 9.3.8. GUI Enhancements – support for different types of charts . . . 9.3.9. Non-relational databases . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . .. . . . . . . . . . . . .. . . . . . . . . . . . .. . . . . . . . . . . . .. . . . . . . . . . . . .. . . . . . . . . . . . .. 181 183 183 183 184 184 185 185 185 185 186 186. Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192.

(11) List of Figures. 1.1 1.2. State diagram of the self-healing system . . . . . . . . . . . . . . . . . . . . . . . Software faults (a) and failure model (b) . . . . . . . . . . . . . . . . . . . . . .. 12 14. 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 2.10. Interactions between SLP entities . . . . . . . . . . . SLP Protocol use cases . . . . . . . . . . . . . . . . . SSDP Protocol – discovery request details . . . . . . . SSDP Protocol – presence announcement details . . . Jini, Discovery protocols, Multicast Request Protocol A sample JXTA Architecture . . . . . . . . . . . . . . Election with the bully algorithm . . . . . . . . . . . Election with the ring algorithm . . . . . . . . . . . . Dynamic Monitoring Framework . . . . . . . . . . . . Architecture of PANACEA framework . . . . . . . . .. . . . . . . . . . .. 20 21 24 25 27 28 32 33 39 41. 3.1 3.2. System use cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Distribution of components of the monitoring system. . . . . . . . . . . . . . . .. 46 51. 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 4.10 4.11 4.12. Typical components of Monitoring System . . . . . . . . . AgeMon System – architecture overview . . . . . . . . . . . AgeMon System – a sample deployment . . . . . . . . . . AgeMon System – components decomposition . . . . . . . Fa¸cade and Event Listener design patterns. . . . . . . . . . Agent Communication Layer – Transport Layer design . . Different implementations of Discovery . . . . . . . . . . . Model-View-Controller design pattern . . . . . . . . . . . . Components in the GUI . . . . . . . . . . . . . . . . . . . . Observer pattern used to notify the Monitored System . . . Entities in the database structure for the Persistence Role. Command Line Interface – architecture . . . . . . . . . . .. . . . . . . . . . . . .. . . . . . . . . . . . .. . . . . . . . . . . . .. . . . . . . . . . . . .. . . . . . . . . . . . .. . . . . . . . . . . . .. . . . . . . . . . . . .. . . . . . . . . . . . .. . . . . . . . . . . . .. . . . . . . . . . . . .. . . . . . . . . . . . .. . . . . . . . . . . . .. 57 60 61 62 66 66 69 71 71 76 78 82. 5.1 5.2 5.3 5.4 5.5 5.6. Agent Class Diagram . . . . . . . . . Feedbacks interfaces . . . . . . . . . . Design of the monitoring role . . . . . GUI Role – Agent Graph . . . . . . . Agent component – dynamic features Create Monitoring Wizard . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. 87 91 93 94 95 97. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . . . . . .. . . . . . .. . . . . . . . . . .. . . . . . .. . . . . . . . . . .. . . . . . .. . . . . . . . . . .. . . . . . . . . . .. . . . . . . . . . .. . . . . . . . . . .. . . . . . . . . . .. . . . . . . . . . .. . . . . . . . . . .. . . . . . . . . . .. . . . . . . . . . .. . . . . . . . . . .. . . . . . . . . . .. 5.

(12) List of Figures 5.7 5.8 5.9 5.10 5.11 5.12 5.13 5.14 5.15 5.16 5.17 5.18 5.19 5.20. ’Create Monitoring’ Wizard . . . . . . . . . . . Rule Component . . . . . . . . . . . . . . . . . Persistence Component . . . . . . . . . . . . . Manage Visualisation Dialog . . . . . . . . . . Example visualisation . . . . . . . . . . . . . . Architecture of the Visualisation Component . Components used by the Rule Role . . . . . . Data Access Object Design Pattern . . . . . . Persistence Role Components . . . . . . . . . . CLI Role - screenshot with help . . . . . . . . CLI Role – screenshot with connections . . . . States of the Agent in the AgeMon System . . Startup of the Services in the AgeMon System Design of the Local Repository . . . . . . . . .. . . . . . . . . . . . . . .. . . . . . . . . . . . . . .. . . . . . . . . . . . . . .. . . . . . . . . . . . . . .. . . . . . . . . . . . . . .. . . . . . . . . . . . . . .. . . . . . . . . . . . . . .. . . . . . . . . . . . . . .. . . . . . . . . . . . . . .. . . . . . . . . . . . . . .. . . . . . . . . . . . . . .. . . . . . . . . . . . . . .. . . . . . . . . . . . . . .. . . . . . . . . . . . . . .. . . . . . . . . . . . . . .. . . . . . . . . . . . . . .. . . . . . . . . . . . . . .. . . . . . . . . . . . . . .. 97 99 101 102 103 104 107 112 113 115 117 117 119 119. 6.1 6.2 6.3 6.4 6.5 6.6 6.7 6.8 6.9 6.10 6.11 6.12 6.13 6.14 6.15 6.16 6.17 6.18 6.19 6.20 6.21 6.22 6.23 6.24 6.25 6.26 6.27 6.28 6.29 6.30 6.31 6.32 6.33 6.34. “Get Monitoring Name” Request . . . . . . . . . . “Get Monitoring Name” Response . . . . . . . . . “Start Election” Message . . . . . . . . . . . . . . “Election Token” Message . . . . . . . . . . . . . . “Advert Request” Message . . . . . . . . . . . . . “Advert Reply” Message . . . . . . . . . . . . . . “Delete Link” Message . . . . . . . . . . . . . . . “Delete Link” Message . . . . . . . . . . . . . . . Three debug messages displayed in the GUI agent “Truncate Database” Message . . . . . . . . . . . “Get Database” Description . . . . . . . . . . . . “Database Description” Message . . . . . . . . . . “Stop Agent” Request . . . . . . . . . . . . . . . . “Get Results From Database” . . . . . . . . . . . Get Results From Database Response . . . . . . . ”Query Database” Message . . . . . . . . . . . . . “Query Database” Response . . . . . . . . . . . . “Monitoring Result” Message . . . . . . . . . . . . Active Monitorings Advert Request . . . . . . . . Active Monitorings Advert Response . . . . . . . . ”Advert Monitoring Started” . . . . . . . . . . . . “Persisted Monitorings Description” Request . . . “Persisted Monitorings Description” Response . . “Request Monitoring Sources” . . . . . . . . . . . “Request Monitoring Sources” Response . . . . . . “Stop Monitoring” Request . . . . . . . . . . . . . Stop Monitoring Request . . . . . . . . . . . . . . Transferring Data From Substitute Agent . . . . . “Deploy Rule” Message . . . . . . . . . . . . . . . “Delete Rule” Message . . . . . . . . . . . . . . . Create Link Source Request . . . . . . . . . . . . . “Create Link Source” Response . . . . . . . . . . . “Advert Link Created” Message . . . . . . . . . . “Get Rules” Message . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 128 128 129 130 130 131 131 132 133 133 133 133 134 134 135 135 135 136 137 137 138 138 138 139 139 139 140 140 141 141 141 142 142 142. 6. . . . . . . . . . . . . . ..

(13) 6.35 6.36 6.37 6.38 6.39. “Get Rules Replay” Message “Monitoring Start” Request . “Monitoring Start” Replay . “JMX Advert” Request . . . “JMX Advert” Replay . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. 7.1 7.2 7.3. Redundancy of Key Components in AgeMon System . . . . . . . . . . . . . . . . 147 Overview of the election algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 149 Design of election components . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151. 8.1 8.2 8.3 8.4 8.5 8.6 8.7 8.8 8.9 8.10. Measurements for Persistence Agent in Scenario 1 . . . . . Measurements for Persistence Agent in Scenario 2 . . . . . Measurements for Persistence Agent in Scenario 3 . . . . . Performance of Persistence Agent . . . . . . . . . . . . . . Scalability in the AgeMon system . . . . . . . . . . . . . . Latencies between Monitoring and Rule Agents . . . . . . . Latencies between Monitoring and Persistence Agents . . . Process of the election of a substitute agent . . . . . . . . . Operations performed after Persistence Agent is Recovered Restarting Monitoring Agent . . . . . . . . . . . . . . . . .. . . . . . . . . . .. . . . . .. . . . . . . . . . .. . . . . .. . . . . . . . . . .. . . . . .. . . . . . . . . . .. . . . . .. . . . . . . . . . .. . . . . .. . . . . . . . . . .. . . . . .. . . . . . . . . . .. . . . . .. . . . . . . . . . .. . . . . .. . . . . . . . . . .. . . . . .. . . . . . . . . . .. . . . . .. . . . . . . . . . .. . . . . .. . . . . . . . . . .. 143 143 143 144 144. 161 163 164 166 166 168 171 173 174 175.

(14) List of Listings. 5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9 5.10 5.11 6.1 6.2 6.3 6.4 6.5 6.6 6.7 6.8 7.1 8.1. 8. Two key interfaces used in the Monitoring Role . . . . . . . . . . . . . . Abstract Agent Action . . . . . . . . . . . . . . . . . . . . . . . . . . . . Wizard Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Parameter Annotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . IChart interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Two examples of the constraint definition using the Java Script . . . . . Example which presents how the script will be called from the AgeMon Example of console reading in Java . . . . . . . . . . . . . . . . . . . . . Command and Attribute interfaces . . . . . . . . . . . . . . . . . . . . . Enumeration used to enable contextual commands . . . . . . . . . . . . Service Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Transport Layer Interface . . . . . . . . . . . . . . . . . . . . . . . . . . Agent and Agent Group methods used to messaging . . . . . . . . . . . Method used to process messages by Agent . . . . . . . . . . . . . . . . Message Interface and Processor Interface . . . . . . . . . . . . . . . . . Severity levels used in Debug . . . . . . . . . . . . . . . . . . . . . . . . Query used to the list of tables in the database (HSQL) . . . . . . . . . Wrapper for the Monitoring Result . . . . . . . . . . . . . . . . . . . . . Severity levels used in Debug . . . . . . . . . . . . . . . . . . . . . . . . Election interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bash script used to restart a remote agent . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . .. 89 96 98 100 105 108 109 114 115 116 118 123 124 125 125 132 135 136 140 150 174.

(15) Chapter 1. Introduction. This chapter introduces motivations of the thesis. It is followed by a list of objectives. After that, the key concepts like high availability, fault tolerance or self-healing are introduced. Later on, a discussion of Thesis’s contribution to the current research is provided. An overview of text organization is given at the end of the chapter.. 1.1. Motivation Nowadays computer systems become more and more complicated. This statement is especially up-to-date when the distributed systems are concerned. The number of distributed systems is rapidly growing. A good example for this trend is cloud storages and cloud computing. A few years ago those terms were known only to a limited number of people who worked in IT industry or to scientists. Today, solutions which use clouds are available broadly for end-users. Complex distributed systems can be built with the components which cooperate together in order to achieve common goals. The monitoring of such components is especially challenging. Decomposition of a system results in a need in distributing the monitoring system itself. Moreover, the complexity of applications requires that the monitoring system provides some aspects of ’intelligence’ – it should be possible to guide the user about what is the most optimal way of monitoring. The most complex monitoring systems that are currently available are able to work in an autonomous way. It means that some or most of the operations are executed without user interaction. A monitoring system based on observations of an application can decide what action should be taken – for instance if any other monitoring should be performed or if a user interaction is required.. 9.

(16) Introduction. Similarly to clouds, terms like High Availability and Fault Tolerance are becoming very popular. It is not long ago, when High Availability was reserved to the military or banking systems. Currently, even basic hosting solutions are advertised as Highly Available. The solutions which combine both High Availability and High Performance Computing (HPC) are becoming available for much more users [135]. This change drives changes in monitoring systems. The question is: how can a Highly Available application be monitored? Increasing the complexity of a monitoring system results in a higher probability of faults in a system. In such a situation, the system should recover from a fault. In other words, it should be able to perform self-healing. Healing can be also considered from a perspective of an application. A good monitoring system in addition to a regular monitoring can provide a way of healing the application. Based on predefined rules, the system could take an action to help the application – for instance to restart components or disconnect a failed resource. There is a big number of monitoring systems available at the market. Such systems can be divided into multiple groups based on the capabilities they offer. For example, some of the products are best to monitor Solaris based systems while others are used to monitor Corba based applications. Most of the modern monitoring systems supports distributed monitoring. Systems like Zabbix, Ganglia or Nagios are mature open-source systems with a big community. On the other hand there are a huge number of commercial software. One of the most mature software commonly used in large companies (46 companies of the top 50 Fortune 500) is a Compuware dynaTrace [19]. Unfortunately, the existing solutions do not provide all the features which are required by some of the modern applications. The existing monitorings systems: . are sometimes used to monitor highly available applications or systems but they are themselves not highly available or fault tolerant do not provide self-healing capabilities are hard to deploy and complex to use usually manifest problems when it comes to integration with applications (in order to heal the application). In order to monitor a highly available application, monitoring system should also be highly available. This will minimize the risk of loosing important monitoring data gathered at system runtime. In the next section some of the key concepts and terms will be introduced and discussed.. 1.2. High availability An availability of a system determines if the system is able to provide the required service. When a service cannot be used by the user, it is said that there is an outage in the system. Downtime is duration of a time when the system is unavailable [120]. Highly Available (HA) system is designed to avoid losses of a service by reducing failures and downtimes of the system. System availability can be measured and is usually expressed as a percent of time when the system is available in a particular year. Based on this, the systems can be grouped in multiple classes [122]. These classes are presented in Table 1.1.. 10.

(17) 1.2. High availability. Availability. Class. 90.%. 1. 99.%. 2. 99.9%. 3. 99.99%. 4. 99.999%. 5. 99.9999%. 6. 99.99999%. 7. Table 1.1: Classes in High Availability Systems. A system which provides 99.999% percent of availability is considered as a high-availability system (the term five-nines is also used). The downtime of such a system should not take more than 5.5 minutes per year. ,Six-nines’ and ,seven-nines’ are known as very-highly available or ultra-available systems. High availability implies a service level agreement in which both planned and unplanned computer outages do not exceed a small stated value. HA systems should require no human interaction in order to recover. It is not feasible for the user to solve an issue in a five-nines system when the maximum downtime is one second per day. HA concepts should be taken into consideration starting from the first step of creating system architecture. Such systems should combine both software techniques and industry-standard hardware to minimize a downtime by quickly restoring essential services. An HA does not assume that the system will operate without an interruption – the key here is to reduce the time of a failure and provide a rapid way of recovery [121]. High Availability systems are reactive – emphasis is on a failover and a recovery [123]. In addition, continuously available systems group applications with a proactive approach. Such systems try to detect and prevent errors in advance. Fault tolerance A Fault Tolerance is usually more focused on the hardware side of the system. In one of definitions [121], a fault tolerant environment has no service interruption at all. It is usually realized by a specialized hardware which detects faults and instantaneously switches to redundant components – processors, power supplies or storage systems. Unfortunately, this usually means significantly higher costs, while a total number of components are at least duplicated. Fault Tolerance can be also considered as a way of achieving HA [120]. In some papers [122], Fault Tolerant is referred to as HA systems which provide 99.99% availability. Sometimes FT is used in the same context as HA – both terms. 11.

(18) Introduction. are considered as exchangeable. For sure, architecture concepts like a redundancy or a replication can be used to build both Fault Tolerant and High Availability systems.. 1.3. Self-healing Self-healing is the ability of a system to recover from a failure state. Additionally, a self-healing system should be able to perceive that its own operation is not correct [105]. A healing action can be performed in an autonomous way or could require a user intervention (assisted-healing systems). Self-healing can be discussed as a separate entity, but sometimes it is considered as a subclass of the fault tolerant systems [106]. Self-healing brings a considerable added value to fault tolerant systems – it can significantly improve the availability of the system. In other papers [107] self-healing is described in the context of self-managing autonomic systems. In such systems human operator takes on a new role. He/she does not control the system in a direct way. Instead, the user defines general rules and policies that guide the self-management process. The IBM Company defines the following areas of autonomicity [108]: self-healing – automatic fault detection and recovery self-configuration – dynamic system configuration self-protection – automatic detection and prevention from the attacks self-optimization – automatic reconfiguration for better performance. The key questions when self-healing is concerned is whether the healing can be done automatically (self ) or with a user interaction. In this thesis, self-healing is understood as it is introduced in the most common definition. Similarly to the IBM concepts, self-healing is an action which should not involve any manual user interaction during the failure detection and recovery. Therefore, a system with the self-healing functionality should also be autonomous (self-healing components of the system should be autonomous). An automaticity of the system doesn’t exclude a user interaction, for instance in a system setup. While the failure detection and recovery should be completely automatic, human interaction may be needed to define high level rules for decision making, templates, to define data sources, properties, etc. [109]. In Fig. 1.1 a state diagram of a self-healing system is presented [105]. System Recovery. System Recovery Maitenance of health Normal State. Degraded State. Failure Detection. Broken State. Failure Detection. Figure 1.1: State diagram of the self-healing system. 12.

(19) 1.3. Self-healing. There are three states of the system from the self-healing perspective. The most desirable state is a normal, healthy state. The system in this state works correctly, and should fulfil all the requirements. Nevertheless, it doesn’t mean that the system in this state is not performing the self-healing types of actions. For instance, the system should periodically check its own state. It can be done by analysing application logs. Other types of logs – e.g., performance logs – can be used to verify the responsiveness of an application. Additionally, the system needs to manage the state of its redundant components and maintain diversity – these tasks can result in failure detection, but the main intention is to keep the system healthy. In addition to the tasks used to maintain a healthy system, there are tasks which are used to detect a failure in the system. There are multiple strategies here: detect the missing components – the system should be aware of all its components and it should be able to figure out if a component is not present. This approach is also described as the missing message [110]. Detection of the missing components is sometimes not a trivial task [111, 112]. system monitoring – there are different models of a monitoring which can be applied here. For instance, a monitoring data can trigger an action defined by the user [113]. It can also use a historic data and trigger only when a discrepancy is found between the current state and the historic data. The system can proactively perform probing for a data. Monitoring can also detect the missing or duplicated components. Additionally, some other techniques were developed to identify foreign elements in the system [114, 115]. A failure itself is a manifestation of an error caused by a system fault [132]. There are a number classifications of errors. For instance, it is possible to classify software faults based on circumstances needed to trigger an error [133, 134]. The first type of faults are named as “Bohrbugs” – it refers to a group of faults which can be easily reproduced. Such errors can be caught during a testing phase of the software release cycle. The second group of problems – “Heisenbugs” is much more difficult to catch. This term groups faults which are hard to reproduce. The same fault may not occur after an application is restarted. Such issues are mostly caused by the non-deterministic behaviour of the system. In such systems it is not possible to predict all possible system states. One of the most common examples of such issues are race conditions in multi-threaded environments. The third fault category are bugs caused by the transient hardware issues. Modern hardware (e.g. CPU) is more error-prone due to the fact that e.g. microprocessors are much denser packed. Fortunately, most of such failures are detected and fixed by low level control programs and are transparent to the application. The last group of faults is related to system ageing. Systems which run long time tend to act slower (e.g. due to resource leaks). Figure 1.2.a summarizes the list of faults. A failure can be manifested in different ways in the system. Failure model describes each type of the failure and how system reacts. Figure 1.2.b presents different failure models [134]. There are four types of behaviour: crash failure – process is stopped and it does not recover, omission failure – process is not stopped, but sometimes it does not respond to the inputs, 13.

(20) Introduction. . timing failure – process is not stopped, but responses are sent to early or too late, byzantine failure – process can behave totally arbitrarily.. a). b). Task stops. Heisenbug. Task does not always repond. Aging Byzantine. Task responses do not meet SLA Totally arbitrary behavior. Figure 1.2: Software faults (a) and failure model (b). When a failure in the system is detected, the system is recognized as a broken one. Self-healing systems should be able to recover from this state, to heal themselves. One of the major concepts used here is the redundancy of components. With this approach, in case of a failure of a component, other components can take the responsibility of the failed component. It may decrease the performance of the system, but the system will continue to operate. In loosely coupled environments, an additional problem is to detect the malicious fault, known also as the Byzantine Generals Problem. In order to solve such types of problems, a voting procedure should be implemented. Other solutions for recovery are typically very specific to a particular system. Multiple attempts have been made to structurize these problems. For instance, in one of the approaches, healing is done through cooperation between components. This type of recovery is becoming very popular [116, 117]. In addition to a normal and a broken state, a degraded state can be introduced. The system in this state can be considered working, but some of its functions could be limited. For instance, a performance indicator could be significantly degraded. If no action is performed immediately, the degraded state can be quickly turned into the broken state. In other example, let’s consider an application which uses 99% of memory. It is very probable that the system will fail soon, if no action to free the memory is performed. In Thesis the self-healing approach is the fundamental concept used to achieve high availability. It will be widely described in Chapter 7.. 1.4. Agent based systems Throughout Thesis we make use of the concept of agent-based systems. The reason for selecting this approach is discussed in section, while in this section a short overview of the agent-based systems is presented. The term agent has multiple definitions. In general, agent is a program or service that acts for the user or another program. The agent should work for some time without user interaction, therefore it should contain the logic to decide which particular action is appropriate [74]. In the literature one can encounter multiple definitions which vary in details. In some of them [76] the agent should perform autonomous actions – decide itself what 14.

(21) 1.5. Reliability. it needs to do in order to satisfy the design objectives. In addition to that the agent should be capable to interact with other agents, miming the social activities like cooperation, negotiation and like. This definition can be even more elaborated. In one of the definitions, agent is representing a social component of a large system in order to explore an emergent global behaviour in a simulation [77]. On the other hand, agent can be also understood as a software agent, similar to the service/daemon. With this definition, the agent does not have, or has a very limited amount of intelligence [75]. The agent can be “intelligent” (if it implements some aspects of artificial intelligence like reasoning or learning) or/and “autonomous” (if agent can modify the way in which it achieves the goals). The second type of classification is based on the deployment type. The distributed agents systems can be used to describe the type of the deployment where the agents are executed on physically different machines. If, in addition to that, agents can relocate between different machines, we can speak about mobile agents systems. The multi-agent systems are used when a single agent cannot achieve the goal. In order to successfully solve a problem, agents need to cooperate and exchange messages. There are several attempts to define a practical model of multi-agent systems. One of the example architectures is called M-Agent Architecture. It introduces a general description of a system without enforcing a programming language or formalisms used to describe different parts of the system [139, 140]. In Thesis, we focused on non-fully autonomous agents. The system will be based on agents which perform actions predefined by the user and do not have automatic social behaviour. In the next versions of the proposed system/architecture more autonomous agents should be concerned.. 1.5. Reliability In general, software reliability is a probabilistic measure which can be defined as probability that software faults does not cause a failure during specified time [150]. From a mathematical point of view, it can be defined as following function: Z∞ R(t) = P r{T > t} =. f (x)dx [150] t. where R(t) = reliability T = working time without failure t = required (assumed/specified) working time without failure f (x) = failure probability density function There are different approaches to measure software reliability. For instance, one can assume that reliability has only two values: 0 or 1, meaning that software is reliable or it is not reliable. Of course, this metric could be hard to use especially for complex systems. More complex models used to measure reliability of the software are listed below. 15.

(22) Introduction. . Times Between Failures Models – models from this class focus on time between failures. Some approaches from that class assume that distribution of faults depends on the total fault count in the program. It assumes that during the ageing of the software, times will get longer due to fixing of bugs. Estimated parameters are obtained from an observation of the program. Example models are: Jelinski and Moranda De-Eutrophication Model [141], Shick and Wolverton Model [142], Littlewood-Verall Bayesian Model [143]. Failure Counts Models – models from are interested in the number of failures in specified time intervals. Values for the models are provided from observations or are results of transformations results from other classes. Example models are: Goel-Okumoto Nonhomogeneous Poission Process Model [144], Musa Execution Time Model [145], Shooman Exponential Model [146]. Fault Seeding Models – the approach used in these models is to “seed” a known number of faults in the program. During testing phase some of the seeded faults are discovered together with indigenous faults. With help of combinatorics it is possible to calculate the total number of indigenous faults and reliability of the application. The most popular model is Mills Seeding Model [147]. Input Domain Based Models – the basic approach used is to generate a set of test cases from an input distribution. Example models are: Nelson Model [148], Ramamoorthy and Bastani Model [149]. Most of the above models are very helpful during developing of the software. On the other hand, some of them are quite complex to evaluate. From the industry perspective, there are two major metrics that are used to evaluate reliability of on software: Mean Time Between Failures (MTBF) – elapsed time between failures in the system. This assumes that the system can be recovered (manually or automatically) from the failure (or the failure does not affect the overall functionality of the system). P (beginning of downtime - beginning of uptime) M T BF = number of failures . Failure Rate – frequency of the failures in the system. It has the following correlation with MTBF: 1 λ= M T BF Measuring reliability is especially useful during the process of developing software. It can be used as a metric determining a current state of software and see if there are improvements during different phases of testing (unit testing, acceptance testing, integration testing, soak and stress testing). While the reliability of the system is frequently discussed together with high availability, there are significant differences between them. It is possible that highly available system will be considered very poor in terms of reliability (opposite correlation is also possible). Let’s consider the following example. The system under discussion is intended to perform registration of the stock market’s transactions. Due to the bug in the system, it is not possible to process the transaction during one second interval at every other midnight. The availability of the system can be calculated as: 16.

(23) 1.6. Keywords and Definitions. tup ≈ 0.999994 ttotal Based on the above calculations, the system can be considered as highly available. On the other hand, MTBF for this system is 48 hours – one second of downtime every other day. This implies that the system cannot be considered reliable (especially for registering stock operations). A=. 1.6. Keywords and Definitions The topic of Thesis is focused on the monitoring system and the monitored system. It is very important to have a clear distinction between both systems. Therefore, the monitored system (an application, an operating system) will be called System Under Monitoring – SUM. A system which is performing the monitoring is called Monitoring System – MS. HA is an acronym which stands for High Availability while FT means Fault Tolerance. In Thesis high availability of Monitoring System is achieved if its availability rate is 99.999%. The reliability of the monitoring system will be measured with MTBF metric.. 1.7. Objectives The main objective is to verify the following thesis: Self-healing techniques enable building of highly available and reliable monitoring systems needed for developing and maintaining highly available applications. In order to examine this thesis, a new monitoring system will be developed. It will be used to evaluate the proposed solutions. It should provide: self-healing capabilities which will help to implement high availability requirements; the system cannot loose any monitoring data gathered during a monitoring session, distributed, loosely-coupled architecture – the design of the system should be based on a distributed architecture, the system should be able to monitor distributed applications, autonomicity – it should be possible to define and deploy the rules and actions which will be executed automatically by the monitoring system, capability of integration with an application – the model should allow for integrating the monitoring system with an application in order to: heal it, or to provide monitoring data which can be used by an application in its regular functioning. The new monitoring system will be called AgeMon. This name will be used in further paragraphs of this thesis. A more detailed discussion about requirements is provided later on in Chapter 3.. 17.

(24) Introduction. 1.8. Research Contribution The main contributions of Thesis are as follows: analyze different aspects of the High Availability and Self-Healing evaluate if Self-Healing concepts can be used to provide High Availability solutions select the best Self-Healing techniques which can be used in the monitoring systems create a reusable and generic model of a Self-Healing monitoring system create a prototype of monitoring system which will provide self-healing components. It should be possible to reuse these components in other monitoring systems.. 1.9. Organization of Thesis In this Introduction we presented a motivation for a new, self-healing, and highly available monitoring system. The main reason for implementing the new system is that the existing monitoring systems do not provide such functionality. Therefore there is no good way for monitoring Highly Available applications. In the chapters that follow a detailed list of requirements is going to be presented as well as the design and architecture of the new system. Thesis organized in 9 chapters. In Chapter 2 a background of technologies related to Thesis is presented. It is followed by a definition and specification of the system requirements and use cases in Chapter 3 – it covers both functional and non-functional requirements. In Chapter 4, a system design and architecture are discussed. Chapter 5 presents the implementation aspects of the new system – it describes tools and techniques used to develop the system as well as a detailed discussion on its major components. In Chapter 6 the issues of communication between the system’s components are described. It starts with a description of communication stack, while the rest of the chapter presents different messages exchanged in the system. The next chapter (Chapter 7) focuses on the High Availability and Self-Healing requirements. It is followed by Chapter 8 where the tests of the system are presented thoroughly and discussed. In Chapter 9 we summarize Thesis and provide a discussion on the features which can be implemented in the future..

(25) Chapter 2. Technology background. This chapter introduces a description of techniques and technologies which could be used in to implement a Highly Available monitoring system. At the beginning of the chapter discovery algorithms are presented. Later on a description of the election algorithms is presented. Chapter ends with the description of some selected monitoring systems.. 2.1. Discovery and Communication The discovery provides an automatic detection of the services in a computer network. The architecture of the designed monitoring system is distributed; therefore services can be installed on multiple physical machines. One of the requirement states that the system should be easy-to-use, and only a minimal configuration is required. At the same time, the installation procedure should be as simple as possible. In order to implement those requirements, automatic discovery protocols need to be used. Due to that, the user will not need to specify the locations of each of the system components. Whenever it is possible, services/components should be able to discover each other. The communication between components should be reliable – the monitoring messages should not be dropped. At the same time, the communication should be fast enough to handle the large amount of data from monitoring. The first requirement is realised by, e.g., TCP protocol, while the second one is provided by UDP. In this section, an overview of discovery libraries is presented. It is started with the protocol which provides only basic discovery functionalities, like SLP or UPnP. At the end, the more complex libraries are presented (JGroups, JXTA). In addition to. 19.

(26) Technology background. the discovery, they provide also an infrastructure for the messaging which can be used as the communication layer in the AgeMon system. 2.1.1. Service Location Protocol – SLP Service Location Protocol (SLP) [7, 8] is used for discovering and selection network service. This protocol could be considered as a high level protocol for services and applications – it is not used for identifying and discovering computers, hosts or network adapters. The main intention is to provide a scalable solution for describing the capabilities of network services and discovering those services in the network without worrying about low level protocol implementations. The main advantage of using SLP is to free the user from knowing about host name, service port/service name. The protocol provides a way to describe what functionalities the selected service provides – so the user can search through the existing services to find the service that is closer to their needs. The description is made in a property.name=value style, the specification of the query follows the same rules – the user can use LDAPv3 [6] filter. In the SLP we can distinguish between three main entities: User Agent (UA) – the process working on the user side, querying and establishing connections with the services. The user agent is responsible for communicating with the service agents and directory agents. Service Agent (SA) – the process working on behalf of the service, responsible for advertising the service capabilities across the network. Directory Agent (DA) – the process collecting the advertisements gathered from the service agents. It is not required for the SLP to have DA – the main intention of this entity is to enable protocol scalability in the larger environments. Fig. 2.1 depicts the cooperation between each part of the SLP.. Figure 2.1: Interactions between SLP entities. Protocol overview The protocol handles multiple scenarios of discovering and querying service capabilities. In the simplest case in the network there is/are only User Agent(s) and Service Agent(s). UA(s) is sending a multicast service request directly to SA(s) with the specification of a needed service. All services that meet the criteria are sending back a reply. This case is shown in Fig. 2.2.a). The second approach is to use Directory Agent(s). The main purpose of DA is to provide cache functionality for SLP network. If there is a DA in the network UA 20.

(27) 2.1. Discovery and Communication. and SA are obliged to use it in the discovery process. The message exchange between UA/SA and DA is presented in Fig. 2.2.b. To obtain Directory Agent’s address UA and SA can sent a multicast request to the network. Directory Agent should reply to this request with a unicast message. DA should also periodically send an advertisement message on the multicast address – User Agents and Service Agents should listen for such messages (Fig. 2.2.c).. Figure 2.2: SLP Protocol use cases: a) discovery process without Directory Agent, b) discovery process with use of Directory Service, c) Discovery of the Directory Service. When a larger network is considered there is also a possibility to group Agents. The grouping is done by using ’scopes’ – it is just a regular string that describes a group like groupID1. Service Agents and Directory Agents are always attached to the group, while for the UA it is not required. If UA is not bound to any scope it will inquire all the Service and Directory Agents from within the network. The reserved listening port for SLP is 427. This is the destination port for all SLP messages. This port is also used for listening for the broadcast messages. The main disadvantages of that is the fact that on some OS (like Linux) the root privileges should be granted to the user who runs the SLP. There is a possibility to change the default ports but then the integration with other SLP subsystems could be problematic. There is also a special RFC [9] which contains the definition of the APIs that should be provided by the exact implementation of the protocol. It also defines a regular set of the properties used for the SLP configuration. Security Service Location Protocol supports the digital signature of messages. It means that the service URL and the attributes could be checked against the modification. Signature could be generated by the external system configured by administrator. Each signed message contains also SLP Security Parameter Index (SPI) which identifies the algorithm used to generate a signature. All types of agents are obliged to support at least the DSA [10] algorithm. The data encryption is not supported by SLP – the lower level solution should be used like IP encapsulating Security Payload.. 21.

(28) Technology background. Protocol implementations There are two main implementations of the SLP protocol. In this subsection a small description of each solution is given. . OpenSLP The best known Service Location Protocol is called OpenSLP1 . This is an open source implementation; it is a continuation of the project started by Caldera System. For the Linux based environment it provides a SLP daemon (slpd) that provides the functionality of the Service and Directory Agents. The implementation is done in the C language. There are prepared binary libraries for the Linux environment (RPMs). User Agents should use libslp.so library – it is not needed to use slpd for such a kind of the agents. Developers API is provided for the C (as library functions) and for the Java language. The latest stable release is 1.2.1 (3 Mar 2005). On the other hand, a new version is under development – there is beta release: OpenSLP 2.0.0 beta 1. Besides the sources this release contains also daemon binaries for the Microsoft Windows. The OpenSLP is probably the first and oldest implementation of the SLP protocol. It supports almost all of the SLP specification. The main disadvantage of this implementation is the fact that it does not fit well in the case of dynamic environments. The libraries should be preinstalled in the system in order to use the SLP (daemon installation is required to use SA or DA functionality). In case of the use of non-common functions like security the manual rebuild of the sources is needed. . jSLP The jSLP2 is a pure Java implementation of the Service Location Protocol. This implementation can run on every Java2 VM – so it supports also J2ME CDC profile. The main advantage of using this implementation is the fact that it does not depends on the operating system, so it could run on Linux, Windows or Solaris. jSLP implements all requirements defined in the RFC 2608 [7] for the SA and UA as well as most of the optional features like Authentication Blocks (message signatures) and User Defined Scopes. There are some small changes comparing to the RFCs like LADP filters or types of the parameters used in the API. By default in the binaries all the required classes are bundled in a jar. But there is also a possibility to manually select which functionality will be used (SA or UA) and build only the required features to reduce the footprint. The jSLP automatically runs the SLPDaemon which is used for listening for the SLP messages. It is possible to run more than one SLP service (SA) on the same machine in a different VM. In such cases SLP will automatically check if SLPDaemon is already running, so a new daemon is not needed. This daemon contains also a service registry in case there is no Directory Agent in the selected scope in the network. The main advantage of that implementation is the Java implementation – it is really cross-platform. The second is the small size (80kB for UA, SA and daemon) of the binaries. The documentation available on the web page is also worth of mentioning. 1. Open SLP Home page (old)- http://www.openslp.org/, Home page (new) – http://sourceforge.net/projects/openslp/ 2 jSLP Home Page – http://jslp.sourceforge.net. 22.

(29) 2.1. Discovery and Communication. The biggest disadvantage of that implementation is flexibility. Because it has really a small size – some functionality are simplified. For example it is not possible to register a listener to indicate agent group changes. The User is able to ask for a selected agent but it is not possible to check if a new agent had been attached to the group. It is not possible to migrate the directory service after failure. Summary The Service Location protocol is a good solution to automatically discover services in the network. It is used in many different aspects, e.g., it is used to discover the printers and LAN printers, it is used also in WBEM3 [11] as a discover protocol. It is simple to use and flexible. There are also stable implementations of the protocol for the both Java and C/C++ languages. The main disadvantage of this protocol is the fact that it is designed to be used only in the local networks (multicast domain). An additional infrastructure should be developed to enable larger networks. The second problem are the implementations – on the one hand we have the well-known, impressive implementation OpenSLP, but some binary libraries should be preinstalled in the operating system. On the other hand we have a Java implementation, it is cross-platform but some operations are very hard to achieve (or impossible) with that implementation. 2.1.2. Simple Service Discovery Protocol (SSDP) The Simple Service Discover Protocol [12] is a protocol designed to enable automatic discovery of the HTTP clients and HTTP resources. The main intention of the protocol is to resolve problems when an HTTP client is requesting for the service that can be provided by one or more HTTP resources – it helps find out the location of the service which the client desires. In this protocol we can identify two main entities – SSDP client (HTTP client which asks for the resources) and the SSDP service (HTTP resource which provides requested service). The protocol was proposed by Microsoft and Hewlett-Packard. The best example to describe the domain of the problem is the following scenario: the User sets up a small network (in home or office). In this scenario SSDP should start without any configuration, and help the user discover the proper services (like printers, scanners, faxes or routers). The protocol is used as a discovery protocol in the Universal Plug and Play (UPnP)4 which enables automatic configuration of the network components. The protocol was designed as P2P and bring ease to use flexible and standard based connectivity to ad-hoc or unmanaged networks [13]. SSDP introduces the automatic discovery of the network devices. Protocol overview In IPv4 networks SSDP clients are using the multicast address to 239.255.255.250 with the port number 1900 (SDDP channel and SDDP port). This address is used for listening for the UDP messages/requests from other participants. 3 4. WBEM Home Page – http://www.dmtf.org/standards/wbem Universal Plug-and-Play Forum: http://www.upnp.org/. 23.

(30) Technology background. In the SDDP the services are identified by a unique pair of the following values: Service type – designed to identify the type of the service like clock, router using URI notation e.g.: upnp:clockradio, ms:router Unique Service Name (USN) – used to uniquely identify an instance of the service. It is possible that in the network there is more than one service with the same type but the service name should be always unique across the network. The name is also provided in the URI notation. In addition to the above pair the service also provides the expiration information (e.g., 1h) and location URI (e.g., http://location of the service). There are two types of the requests which occur during the discovery procedure with use of SDDP multicast channel: Discovery request – HTTP UDP message with the specification of the requested service. If the service is matching to the specification then it should respond using a unicast HTTP UDP message. The request should always contain a type of the requested service (ST header), and the mandatory field set to “ssdp:discover” (Man header) [14]. The successful response must contain Unique Service Name (USN header) and service type. The response should also contain the location information and expiration information. An example of the discovery request is presented in Fig. 2.3.a and Fig. 2.3.b.. Figure 2.3: SSDP Protocol – discovery request details: a) multicast discovery request used to retrieve an information about the agh:clock service, b) unicast response with a specification of the agh:clock service. . 24. Presence announcement – information from the service for the rest of services about being alive. It also contains the information about the current value of the expiration parameter. If the client is cashing this information then a new value from the message should override the cashed value. This information is sent periodically. An example message is presented in Fig. 2.4..

(31) 2.1. Discovery and Communication. Figure 2.4: SSDP Protocol – presence announcement details. Implementations There is no separate implementation of SSDP – it is always bundled together with the UPnP implementations. Most of the SDKs are listed on UPnP Forum Page5 . The SDK provides an infrastructure (building blocks) for the developer to create an UPnP based services and devices. . Microsoft UPnP SDK Microsoft provides a very good, detailed web page related to the UPnP6 . It contains a lot of documentation about the usage of UPnP in MS Windows Systems as well as samples how to use the API. Starting with Windows ME Microsoft supports the UPnP on the system level (however for the Windows XP, the user need to install additional software from the installation package [15]). In those systems MS provides an API for developers which allows for easy creation of the UPnP entities7 . The APIs are grouped in two collections: Controller Point API – provides a set of the COM interfaces used to find and control devices, equivalent of the SSDP client from the specification Device Host API – provides a set of the COM interfaces used to create a service/device interface. The API is accessible from the C++ and Microsoft Visual Basic programs. It is easy to use as regular Windows API. This advance is also the biggest con – we can use it only on the Windows. It is also uneasy to use from the Java language - the native code is needed. . Cyberlink Development Package for UPnP Devices Cyberlink company develops their own SDKs for the UPnP protocol8 . The SDK is prepared for the Java language and it is also written in pure Java – so the SDK is cross-platform. The SDK provides a simple way to define a device and expose this device in the network. It is as simple as to create an xml device description and write a few lines in Java [16]. The SSDP notification/discovering is done behind the scene – it is hidden by the Java interfaces, therefore it is very good if we want to take an advantage of all the features provided by UPnP but in case we want to just reuse SSDP functionality it will not be easy with this implementation. 5 6 7 8. http://www.upnp.org/resources/sdks.asp Microsoft UPnP – http://www.microsoft.com/whdc/device/media/upnp.mspx UPnP APIs – http://msdn.microsoft.com/en-us/library/aa382303.aspx http://cgupnpjava.sourceforge.net/, http://sourceforge.net/projects/cgupnpjava/. 25.

(32) Technology background. . UPNPLib UPNPLib9 is an open source implementation (The Apache Software Licence) of the UPnP protocol. Same as predecessor it is written in pure Java. It is not a full implementation of the UPnP protocol – it implements only Discovering process and interactions with the devices. Creating a new device is not possible with this library. A very nice feature provided by this SDK is the integration between JMX and the UPnP protocol – devices are exposed with the JMX interface. Summary The SSDP protocol is a good protocol for discovery information about services (especially physical resources). The protocol is textual, lightweight – the service part can be run in every place where a simple web server could be started. It is specially designed for discovering the devices in the local network. The biggest problem is with scalability – it could be used only in the scope of the multicast network. There is no extra repository with cashed services locations like in the SDP (Section 2.1.1). In large networks (with the big amount of the services) it could lead to inefficient network flooding. Microsoft was aware of these problems. The company extends the specification with a new way of sending search messages [17]. It can be sent to the multicast SSDP address as well as for the unicast TCP/IP gateway directly. It allows for more scalability in the protocol. 2.1.3. JINI Jini technology is a Java based approach to service oriented architecture. It can be used to build adaptive network systems consisting of federations of well-behaved network services and clients 10 . The Jini term is used for both the specification and the implementation. The Jini technology is really impressive, it covers many different aspects of distributed computing, and nevertheless only discovery protocols used in Jini will be described here. Discovery protocol overview In the Jini nomenclature all entities that are participating in a distributed system (service, device) are named djinn. The first task for the djinn in Jini system is to discover a Lookup Service. The methodology here is a little bit different than for the previously defined systems. The discovery protocols are not used to discover the requested services – they are used to discover the Lookup Service which will be then used as a registry to obtain a requested service. There could be more than one lookup service in the network. There are three different discovery protocols used to obtain a Lookup Service [18]: Multicast Request Protocol – the protocol used to locate the nearby lookup service (mainly used when the service is starting up). The full process is presented in Figure 2.5. Discovering an entity (client) creates Multicast Response Server for handling responses from Lookup servers, on the other hand Lookup 9 10. 26. http://www.sbbi.net/site/upnp/index.html JINI Main Page – http://www.jini.org/wiki/Main Page.

(33) 2.1. Discovery and Communication. Server creates Multicast Request Server for handling the requests from clients. There could be more than lookup server in the network scope therefore the client should be able to correctly handle such a case (multiple responses from through Multicast Response Server).. 1: creates. Discovering entity. 3. Look up requ est. Multicast response server. Lookup server. Multicast request server. 2: creates. unicast message multicast message. 4. Lookup reference. Figure 2.5: Jini, Discovery protocols, Multicast Request Protocol. . Multicast Announcement Protocol – a protocol used by the lookup service to advertise their existence. The protocol is used when a new lookup service is started, or after network failure to obtain again the reference to the lookup service. The announcement is sent periodically through a multicast address. The Unicast Discovery Protocol – an entity can manually communicate with the interesting Lookup Service with use of this protocol. This is useful for large networks (larger than a multicast network range). The Jini support the grouping. Each group should have a unique name (it is a string based variable, usage DNS-style name is recommended). Entities using the multicast request protocol are able to specify a set of groups they want to communicate with – so only the matching lookup services will reply to such a query. Lookup services are also advertising groups which they belong to the Announcement Protocol. Implementations There is a reference implementation of the JINI protocol made by Sun Microsystems. The implementation is known as Starterkit11 and it is distributed with the Apache Licence v2.012 . This fully implements the specification. The project was offered to the Apache Software Foundation Incubator and now it is developed with a new name – River13 . The second Apache project related to the Jini technology is Rio14 . It is one ‘level up’ – based on the Jini, it provides architecture for developing, deploying and managing the distributed systems. Rio provides QoS based management for distributed systems, providing a policy based approach for fault detection and recovery, scalability and dynamic deployment. Summary JINI provides a good starting point for implementing distributed systems. The implementations like Rio help define the complex system by breaking it down into separate services. 11 12 13 14. JINI Starterkit – https://starterkit.dev.java.net http://www.apache.org/licenses/LICENSE-2.0 Apache River Home Page – http://incubator.apache.org/river/RIVER/index.html Apache Rio Home Page – http://www.rio-project.org/. 27.