Index of /rozprawy2/10089

Pełen tekst

(1)A KADEMIA G ÓRNICZO -H UTNICZA IM . S TANISŁAWA S TASZICA W K RAKOWIE W YDZIAŁ E LEKTROTECHNIKI , AUTOMATYKI , I NFORMATYKI I E LEKTRONIKI K ATEDRA I NFORMATYKI. Monitorowanie gridowych aplikacji naukowych sterowanych przeplywem pracy Bartosz Bali´s. Rozprawa Doktorska. Promotor: Prof. dr. in˙z hab. Jacek Kitowski. Kraków, Listopad 2008.

(2)

(3) AGH U NIVERSITY OF S CIENCE AND T ECHNOLOGY FACULTY OF E LECTRICAL E NGINEERING , AUTOMATICS , C OMPUTER S CIENCE , AND E LECTRONICS D EPARTMENT OF C OMPUTER S CIENCE. Monitoring of Grid Scientific Workflows Bartosz Bali´s. Phd Dissertation. Supervisor: Prof. Jacek Kitowski. Kraków, November 2008.

(4) iv.

(5) To Magda.

(6) vi.

(7) Acknowledgements I would like to thank my supervisor, Prof. Jacek Kitowski, for his support and valuable remarks which helped me very much to improve this dissertation. My special gratitude goes to dr Marian Bubak, my scientific advisor, from whom I learned a great deal. Without his continuous guidance and support this work would have not been possible. I would like to acknowledge the invaluable help of Prof. Roland Wismüller, collaboration with whom taught me a lot and helped me develop my research interests. Warm thanks go to my colleagues from the Department of Computer Science at AGH: Włodek Funika, Kasia Rycerz, Maciek Malawski and Renata Słota. Thanks for the inspiring discussions and a friendly atmosphere in my daily work. I owe many thanks to my colleagues from CYFRONET: Marcin Radecki and Tomek Szepieniec with whom I had the pleasure to collaborate for several years. The help from other people who have contributed to this work is also gratefully acknowledged: let me mention Kuba Dziwisz, Kuba Rozkwitalski, Kuba Wach, Michał Pelczar, Bartek Kowalewski, Bartek Łabno, Krzysiek Guzy, and Wojtek Rz¸asa. This research was partially supported by EU-funded projects CrossGrid, K-Wf Grid and ViroLab, and the Foundation for Polish Science..

(8)

(9) Abstract Scientific workflows are increasingly used to represent and manage scientific computations in modern environments for E-Science. Application monitoring is important in many scenarios. However, while until recently scientific applications were mainly homogeneous, parallel and tightly coupled, deployed on clusters, scientific workflows are better characterized as heterogeneous, distributed and loosely coupled. Moreover, a modern cyberinfrastructure for E-Science typically exploits Grid technologies to access the underlying resources. Consequently monitoring of Grid scientific workflows poses new specific challenges. Those challenges also arise from the exceptional diversity of scenarios where monitoring of Grid scientific workflows is important. The main contribution of this dissertation is the identification and analysis of key challenges in monitoring of Grid scientific workflows, elaboration of solutions to those challenges, and validation of those solutions. Four areas where key issues arise are recognized: building an infrastructure for Grid scientific workflows, solving the problem of on-line monitoring support within this infrastructure, monitoring of workflow legacy backends, and development of an information model for recording workflow executions. The results of the research are validated using diverse methodologies. Prototypes of the designed software components are built and used for monitoring of real-life workflows – a Coordinated Traffic Management application, deployed within the EU-IST K-Wf Grid Project infrastructure, and a Drug Resistance application, deployed in the virtual laboratory of the EUIST ViroLab Project. A model-based performance analysis using two different approaches – Queuing Networks and a discrete-event simulation – is carried out in order to validate the performance characteristics of the proposed solutions..

(10) x.

(11) Contents 1. 2. Introduction 1.1 Research Motivation . . 1.2 Research Goals . . . . . 1.3 Dissertation Statement . 1.4 Scientific Contribution . 1.5 Research Methodology . 1.6 Dissertation Organization. . . . . . .. . . . . . .. . . . . . .. . . . . . .. Background 2.1 Introduction . . . . . . . . . . . 2.2 E-Science & Grid Computing . . 2.3 Workflows . . . . . . . . . . . . 2.4 Monitoring . . . . . . . . . . . 2.5 Semantic Web & Semantic Grid. . . . . . .. . . . . .. . . . . . .. . . . . .. . . . . . .. . . . . .. . . . . . .. . . . . .. . . . . . .. . . . . .. . . . . . .. . . . . .. . . . . . .. . . . . .. 3. Research Roadmap. 4. Related Work 4.1 Introduction . . . . . . . . . . . . . . . . . . 4.2 Monitoring of Grid Applications . . . . . . . 4.3 Models for Recording Application Executions 4.4 Summary . . . . . . . . . . . . . . . . . . .. 5. 6. . . . . . .. . . . . .. . . . . . .. . . . . .. . . . . . .. . . . . .. . . . . . .. . . . . .. . . . . . .. . . . . .. . . . . . .. . . . . .. . . . . . .. . . . . .. . . . . . .. . . . . .. . . . . . .. . . . . .. . . . . . .. . . . . .. . . . . . .. . . . . .. . . . . . .. . . . . .. . . . . . .. . . . . .. . . . . . .. . . . . .. . . . . . .. . . . . .. . . . . . .. . . . . .. . . . . . .. . . . . .. . . . . . .. . . . . .. . . . . . .. . . . . .. . . . . . .. . . . . .. . . . . . .. . . . . .. . . . . . .. . . . . .. . . . . . .. 1 1 4 4 5 6 7. . . . . .. 9 9 9 11 12 14 17. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. 23 23 23 27 32. Monitoring Infrastructure for Grid Scientific Workflows 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3 Key design choices for a Grid workflow monitoring infrastructure 5.4 Grid workflow monitoring architecture . . . . . . . . . . . . . . . 5.5 Representation of monitoring data . . . . . . . . . . . . . . . . . 5.6 Sensors & Sensor Management . . . . . . . . . . . . . . . . . . . 5.7 Subscription management . . . . . . . . . . . . . . . . . . . . . . 5.8 Advertisement and discovery . . . . . . . . . . . . . . . . . . . . 5.9 Instrumentation . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.10 Notes on the prototype implementation of GEMINI . . . . . . . . 5.11 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . .. . . . . . . . . . . .. . . . . . . . . . . .. . . . . . . . . . . .. . . . . . . . . . . .. . . . . . . . . . . .. . . . . . . . . . . .. . . . . . . . . . . .. . . . . . . . . . . .. . . . . . . . . . . .. . . . . . . . . . . .. . . . . . . . . . . .. 33 33 34 34 42 44 51 54 58 59 65 66. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. Support for On-line Collection of Workflow Monitoring Data. . . . .. . . . .. . . . .. 69.

(12) xii Contents 6.1 6.2 6.3 6.4 6.5 6.6 7. 8. 9. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Resource discovery for Grid workflows . . . . . . . . . . . . . . . . . . Alternatives for a monitoring architecture with on-line monitoring support Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Coordinator – prototype implementation . . . . . . . . . . . . . . . . . . Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. Monitoring of Workflow Legacy Backends 7.1 Introduction . . . . . . . . . . . . . . . . 7.2 The OMIS interface . . . . . . . . . . . . 7.3 Extensions to OMIS towards Grid support 7.4 The OCM-G monitoring system . . . . . 7.5 Selective fine-grained instrumentation . . 7.6 Integration of the OCM-G with GEMINI . 7.7 Summary . . . . . . . . . . . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. From Monitoring Data to Experiment Information 8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2 Requirements for an Information Model of Workflow Execution Records . 8.3 Information sources and sinks . . . . . . . . . . . . . . . . . . . . . . . 8.4 Model of Experiment Information . . . . . . . . . . . . . . . . . . . . . 8.5 Aggregation of Monitoring Data into Experiment Information . . . . . . 8.6 Querying capabilities over Experiment Information Records . . . . . . . 8.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Case Study: Instrumentation and Tracing of Scientific Workflows 9.1 K-Wf Grid infrastructure . . . . . . . . . . . . . . . . . . . . . 9.2 Extensions to Monitoring Events Model . . . . . . . . . . . . . 9.3 Monitoring of Coordinated Traffic Management Workflow . . . 9.4 Legacy code support example . . . . . . . . . . . . . . . . . . . 9.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . .. . . . . . . .. . . . . . . .. . . . . . .. . . . . . . .. . . . . . . .. . . . . . .. . . . . . . .. . . . . . . .. . . . . . .. . . . . . . .. . . . . . . .. . . . . . .. . . . . . . .. . . . . . . .. . . . . . .. . . . . . . .. . . . . . . .. . . . . . .. . . . . . . .. . . . . . . .. . . . . . .. 69 69 72 76 79 79. . . . . . . .. 81 81 82 83 85 90 93 95. . . . . . . .. 97 97 98 99 100 101 104 106. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. 107 107 109 109 111 114. 10 Case Study: Obtaining and Using Experiment Information Records 10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.2 The ViroLab virtual laboratory . . . . . . . . . . . . . . . . . . . 10.3 Monitoring and Recording the DRS application . . . . . . . . . . 10.4 Querying over DRS execution records . . . . . . . . . . . . . . . 10.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. 117 117 118 120 124 127. 11 Performance Evaluation 11.1 Introduction . . . . . . . 11.2 Queuing networks . . . . 11.3 Discrete event simulation 11.4 Models parameters . . . 11.5 Queuing Network models 11.6 Simulation models . . . 11.7 Results . . . . . . . . . . 11.8 Summary . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. 129 129 129 132 133 134 137 140 143. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . .. . . . . . . . ..

(13) xiii 12 Conclusion and Future Work 145 12.1 Summary and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 12.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148.

(14) xiv Contents.

(15) List of Figures 2.1 2.2. Taxonomy of monitoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ontologies as inter-lingua . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 14 16. 3.1. Monitoring of Grid Scientific Workflows – the big picture. . . . . . . . . . . . . . . . .. 18. 5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9 5.10 5.11 5.12 5.13. The simplest application monitoring scenario. . . . . . . . . . . . . . . Application monitoring with a separate monitoring system. . . . . . . . Decentralized monitoring architecture with a discovery service. . . . . . Local monitors – sensors and mutators – in the monitoring architecture. Grid workflow monitoring architecture. . . . . . . . . . . . . . . . . . Logical architecture of the monitoring infrastructure. . . . . . . . . . . Conceptualization of application monitoring data. . . . . . . . . . . . . Base taxonomy of workflow monitoring events . . . . . . . . . . . . . Monitoring data dissemination modes: query and subscribe. . . . . . . The subscription scenario. . . . . . . . . . . . . . . . . . . . . . . . . The query scenario. . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sample code in C and its SIR. . . . . . . . . . . . . . . . . . . . . . . The instrumentation scenario. . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . .. . . . . . . . . . . . . .. . . . . . . . . . . . . .. . . . . . . . . . . . . .. . . . . . . . . . . . . .. . . . . . . . . . . . . .. . . . . . . . . . . . . .. . . . . . . . . . . . . .. . . . . . . . . . . . . .. 35 36 37 39 42 43 45 47 56 56 57 61 64. 6.1 6.2 6.3 6.4 6.5. Grid workflow monitoring architecture with Coordinator. . . . . . . . . Grid workflow monitoring architectures with DHT network. . . . . . . Distributed collection algorithm with a Subscription Coordinator. . . . . Distributed collection algorithm with peer-to-peer distributed hash table. Automatic resource discovery with migration of a workflow activity. . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. 72 73 74 75 78. 7.1 7.2 7.3 7.4 7.5. OCM-G components distributed in the Grid . . . . . . . . . . . Start-up of the OCM-G. . . . . . . . . . . . . . . . . . . . . . . OMIS request distribution in the OCM-G. . . . . . . . . . . . . Generation of SIR for an application monitored by the OCM-G. OCM-G integrated with the GEMINI infrastructure. . . . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. 87 89 91 92 94. 8.1 8.2 8.3. Experiment information model associated with domain ontologies. . . . . . . . . . . . . 101 Mapping between monitoring data and experiment information. . . . . . . . . . . . . . 102 Record of a Drug Ranking computation with different semantics variants. . . . . . . . . 105. 9.1 9.2 9.3. Architecture of the K-Wf Grid infrastructure. . . . . . . . . . . . . . . . . . . . . . . . 108 Monitoring events in K-Wf Grid. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 Coordinated Traffic Management workflow. . . . . . . . . . . . . . . . . . . . . . . . . 111. . . . . .. . . . . .. . . . . .. . . . . ..

(16) xvi List of Figures 9.4 9.5 9.6 9.7 9.8 9.9. Monitoring of CTM workflow – global view. . . . Monitoring of CTM workflow – detailed local view. Instrumentation: SIRWF (simplified). . . . . . . . Instrumentation: WIRL (simplified). . . . . . . . . Legacy monitoring: workflow activity level. . . . . Legacy monitoring: legacy backend level. . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. 112 113 113 114 115 116. 10.1 10.2 10.3 10.4 10.5. Architecture of the ViroLab virtual laboratory. . . . . . . . . . . . . Workflow execution event model extended with DRS domain events. Recording experiment execution in ViroLab. . . . . . . . . . . . . . Query construction in QUaTRO GUI. . . . . . . . . . . . . . . . . Query evaluation. . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. 119 121 122 126 127. 11.1 The simplest Queuing Network model with a single service center. . . . . . . . . . . . 11.2 Different classes of resources. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.3 QN model for on-line monitoring with Coordinator. . . . . . . . . . . . . . . . . . . . 11.4 QN model for on-line monitoring with a distributed hash table. . . . . . . . . . . . . . 11.5 Coordinator model, QN vs. simulation – utilizations. . . . . . . . . . . . . . . . . . . 11.6 Coordinator model: QN vs. simulation – queue lengths. . . . . . . . . . . . . . . . . . 11.7 Coordinator model: QN vs. simulation – response times. . . . . . . . . . . . . . . . . 11.8 DHT model: QN vs. simulation – utilizations. . . . . . . . . . . . . . . . . . . . . . . 11.9 DHT model: QN vs. simulation – queue lengths. . . . . . . . . . . . . . . . . . . . . 11.10DHT model: QN vs. simulation – response times. . . . . . . . . . . . . . . . . . . . . 11.11QN vs. simulation – system response times for (a) coordinator model, (b) DHT model. 11.12Simulation results: average system response time – Coordinator vs. DHT (CSIM). . . .. . . . . . . . . . . . .. 130 131 135 137 141 141 142 142 142 143 143 144.

(17) List of Tables 3.1. Key Grid workflow monitoring challenges and their solutions. . . . . . . . . . . . . . .. 19. 5.1 5.2. Basic terms related to a monitoring infrastructure. . . . . . . . . . . . . . . . . . . . . . Basic workflow execution monitoring events. . . . . . . . . . . . . . . . . . . . . . . .. 34 46. 7.1. Comparison of OMIS/OCM-G and GEMINI approach to monitoring . . . . . . . . . . .. 86. 11.1 11.2 11.3 11.4 11.5. Fundamental QN laws. . . . . . . . . . . . . . . . . . . . . . . Formulas for open QN model with multiple classes. . . . . . . . Service demand matrix – coordinator model. . . . . . . . . . . . Service demand matrix – DHT model. . . . . . . . . . . . . . . Service times for facilities used in coordinator and DHT models.. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. 131 132 136 136 138.

(18) xviii List of Tables.

(19) Chapter. 1. Introduction This chapter presents the motivation behind the research carried out within this Dissertation. Research goals, dissertation statement, scientific contribution, and research methodology are subsequently described. Basic concepts surrounding the research are briefly introduced.. 1.1 Research Motivation Workflows have been increasingly used in recent years to represent and manage scientific computations in the form of workflow-based scientific applications, or simply ‘scientific workflows’ [171] [71]. Scientific workflows are used by scientists to conduct in silico experiments. The benefits of using workflows for this purpose include the automation of a scientific data analysis pipeline, ease of use important for non-IT experts, increased reusability, and enhanced capabilities for experiment reuse, and for sharing the experience in a scientific community [120] [118]. While until recently scientific computations were performed mainly via tightly coupled, homogeneous, parallel applications running on clusters, scientific workflows are better characterized as loosely-coupled, heterogeneous distributed applications. Moreover, today computer-aided science, so called e-Science, increasingly uses large-scale collaborative environments which require a complex underlying cyberinfrastructure. Many environments for e-Science adopt Grid infrastructures to manage and provide access to the underlying resources [65]. Monitoring of scientific workflows is important in particularly many diverse scenarios, to name some typical ones: 1. On-line performance analysis. A user is observing the performance of a long-running workflow application using on-line performance analysis tool, and refines performance measurements onthe-fly, based on the currently observed progress of the application. 2. Dynamic reconfiguration. A dynamic scheduling service detects an increased resource demand of a workflow task, and decides to split the task in order to allocate more computation resources..

(20) 2 Chapter 1. Introduction 3. On-line steering. A user is watching the progress of a long running workflow-based simulation and decides to adjust the simulation parameters on-the-fly, based on its results obtained so far. 4. Off-line performance analysis. An application developer is examining a workflow execution trace in order to find performance bottlenecks. 5. Performance optimization. A scheduler is looking for information concerning invocations of specific services, gathered from multiple workflow runs, in order to optimize the utilization of computational resources for a next run. 6. Experiment mining. A scientist is searching for past experiments executed as workflows which satisfy specific characteristics in order to compare his results to those obtained by other scientists. 7. Provenance tracking. A scientist is retrieving provenance information concerning an experiment result obtained by a workflow run, in order to learn what process was used to produce this result and from what input data it was derived. 8. Experiment repetition. A scientist is using a recorded execution of a workflow in order to repeat an experiment, perhaps slightly changing its conditions. The following facts can be observed in those examples. First, some scenarios require on-line monitoring of a workflow (1-3); for some others (4), off-line analysis of a workflow trace is sufficient; all subsequent scenarios (5-8) require historic, queryable records of workflow runs. Second, the consumers of the workflow monitoring data are both humans and processes. Third, in some scenarios (at least 4 and 5) a form of knowledge extraction is performed, i.e. derivation of useful, actionable information from the records of monitoring data.. Challenges in monitoring of Grid scientic workows The above-outlined specific characteristics of Grid scientific workflows, and the diversity of scenarios where their monitoring is important, entails a number of new challenges in monitoring of Grid scientific workflows. While general problems of monitoring of large-scale parallel and distributed Grid applications, such as dealing with high data rates, have been previously studied, no approaches known to the author address challenges specific for monitoring of Grid scientific workflows. Those specific challenges arise from all three aspects: the “Grid” aspect, the “scientific” aspect, and the “workflow” aspect. The “Grid” aspect A Grid infrastructure provides access to a large number of highly distributed resources. Execution of applications in such an environment is handled by multiple middleware services, such as resource brokers, (meta-)schedulers, storage brokers, information systems, execution managers, or job queue managers. This greatly complicates the workflow’s execution cycle and has important consequences for a monitoring infrastructure for Grid workflows. For example, decentralized architecture is mandatory for scalability reasons. In addition, a Semantic Grid has been envisioned as a future infrastructure for e-Science [55]. The authors distinguish three conceptual layers in such an infrastructure: the data/computation layer, the information layer, and the knowledge layer. In this distinction, data is understood as a (typed) sequence of bits (e.g. an integer), information is data equipped with a meaning (e.g. a temperature), while knowledge.

(21) 3 is described as “information applied to achieve a goal, solve a problem or enact a decision”, often defined as “actionable information”. The process of knowledge acquisition may therefore involve search, aggregation or inference over existing information, in order to make the information usable for a particular goal. In the context of monitoring of Grid scientific workflows, especially given the importance of historic workflow execution records, monitoring data should be distinguished from monitoring information. The latter is: • well-structured: as opposed to monitoring data being essentially a collection of events, monitoring information should describe entities, their attributes and relationships between them. • rich with semantics: information about semantics of entities (computations, data, semantic relationships) should be captured; the semantics can be expressed by means of ontologies. Existing workflow monitoring systems mostly focus on performance monitoring, and, consequently, collect merely a trace of monitoring events which is used for post-mortem performance analysis. On the other hand, existing information models concentrate on status of monitored resources, but not the execution of applications. This gap needs to be filled with a model of experiment information – monitoring information describing execution records of Grid (scientific) workflows. The “Scientific” aspect A scientific workflow is a particular type of a workflow which is typically (though not necessarily) computationally- and/or data-intensive, and is used by a scientist to conduct in silico experiments which lead to one or more results valuable in the research carried out by the scientist. There are at least two important challenges that stem from the ‘scientific’ aspect of workflows: • The core computations of scientific workflow activities are often done via legacy applications running in the backends, MPI parallel simulations being a typical example. The monitoring infrastructure will have to deal with monitoring of those legacy applications. • Scientific workflows are special in that their end-users – the researchers – are direct consumers of the workflow execution records. They use those records, for example, to retrieve provenance of an experiment result, to repeat an experiment, or to extract some knowledge by mining numerous records of previous experiments. Consequently, experiment information model needs to take into account this specific, end-user perspective: the record of a workflow execution needs to be also a record of an in silico experiment, as if the scientist would document an experiment in his or her lab book. The “Workflow” aspect A workflow-based application implies loose-coupling, distribution, and heterogeneity with respect to programming languages or platforms (direct consequence of loose-coupling). Heterogeneity is also emphasized by the presence of the Grid..

(22) 4 Chapter 1. Introduction • Loose-coupling combined with the characteristics of the Grid make the workflow execution highly unpredictable. Specifically, it is hard to predict where and when the individual workflow activities shall run. Consequently, a resource discovery mechanism is required to automatically and quickly discover new workflow activities. Otherwise on-line monitoring, essential in certain scenarios, will not be possible. • Multi-aspect heterogeneity of workflows implies that many diverse technologies and concepts for monitoring and instrumentation will have to be involved in monitoring of a single workflow. On the other hand, the monitoring infrastructure should provide a unified view of the workflow. Consequently, the design of the monitoring infrastructure will have to conceal the inherent heterogeneity of a workflow-based application.. 1.2 Research Goals The overall goal of this work is to address the above-mentioned specific challenges related to monitoring of Grid scientific workflows. This goal can be broken down into several detailed goals which can be summarized as follows: • To provide a monitoring infrastructure for Grid workflows which supports monitoring and instrumentation services for heterogeneous and distributed workflows, in a uniform, transparent way, and at different levels of granularity. In other words: • the monitoring infrastructure should conceal the underlying heterogeneity of a workflow behind standard interfaces, protocols, and data representations; • in addition, the monitoring infrastructure should provide a framework for an easy accommodation of different monitoring and instrumentation tools; • it should be possible to dynamically adjust the level of granularity the monitoring should be applied at – from coarse grain (workflow level), to fine grain (code region level). • To support monitoring of legacy applications, often running as computational backends of scientific workflows, within a monitoring infrastructure, in the way described above. • To ensure on-line and scalable monitoring of Grid workflows. On-line means that the progress of the application’s execution can be observed during its runtime. Scalable means the capability to maintain on-line monitoring services for large workloads – at least as large as those currently found in large-scale production Grid infrastructures – in terms of generated monitoring data due to workflows execution. • To develop a monitoring information model for representing records of workflow executions. The information model should support major scenarios involving history of past workflow executions, in particular those where end-users are consumers of those records.. 1.3 Dissertation Statement The statement of this Dissertation is:.

(23) 5 Monitoring of Grid scientific workflows requires a proper monitoring infrastructure and can lead to valuable actionable information, provided that workflow execution records collected by monitoring are properly structured and enhanced by application-domain-specific semantics. ‘Proper’ monitoring infrastructure means that it should be designed in such a way so as to address the specific challenges related to monitoring of Grid scientific workflows. ‘Valuable actionable information’ should be understood as knowledge that a consumer – a program or a human – can extract from information gathered in workflow execution records.. 1.4 Scientic Contribution The main contribution of this Dissertation to Computer Science is the identification and analysis of key challenges in monitoring of Grid scientific workflows, elaboration of solutions to those challenges, and validation of those solutions. This contribution can be broken down into the following accomplishments: 1. A comprehensive analysis of requirements for a monitoring infrastructure suitable for Grid scientific workflows has been carried out. Combined with a study of existing monitoring approaches, it led to the understanding of gaps in the currently available monitoring solutions and, consequently, to a design of a monitoring infrastructure for Grid workflows. This design comprises the following detailed achievements: • The design of the infrastructure as a monitoring framework in order to conceal the underlying heterogeneity of workflows. • The elaboration of a model for workflow monitoring events as an extension to the Global Grid Forum’s Discovery and Monitoring Event Description (DAMED) definition of Grid monitoring events. • The proposal of Complex Event Processing technologies to handle subscriptions in the monitoring infrastructure. • The introduction of a standardized Instrumentation Service as part of the monitoring infrastructure, by adoption of existing specifications: Workflow Instrumentation Request Language (WIRL), and Standardized Intermediate Representation (SIR). 2. The problem of automatic resource discovery has been identified as a prerequisite to enable on-line monitoring of Grid workflows, and solved by proposing a Distributed Hash Table infrastructure to be federated with the monitoring infrastructure. The proposed solution has been evaluated for scalability and compared with a centralized solution. 3. Monitoring and instrumentation of legacy applications has been addressed in a portable Gridenabled way: (1) by adopting and extending an existing OMIS approach, (2) by development of the OCM-G, an OMIS-compliant Grid-enabled monitoring system for parallel applications, (3) and by integration of the OCM-G with the overall workflow monitoring infrastructure. 4. An information model to describe records of workflow executions has been proposed. Semantic Web technologies have been employed to describe this model and to represent the actual records, in order to enhance the records with application-specific semantics, such as one of the scientific domain the workflow concerns. It has been proven that a well-defined information layer describing.

(24) 6 Chapter 1. Introduction workflow executions enables end-users to extract actionable information from workflow execution records. The following communities could benefit from the scientific achievements of this work: • Designers of Grid middleware services can benefit in several ways. First, interoperability of existing monitoring services can be enhanced by adopting some of the proposed solutions as standards, notably the event taxonomy of workflow monitoring events, and the monitoring information model for Grid scientific workflows. Second, the performance and scalability of Grid information services could be improved thanks to research on using DHT infrastructures to enhance automatic resource discovery. • The Semantic Grid community can take benefit from the proposed information model for Grid workflow execution records, in order to enhance the interoperability of Grid services, and facilitate semantic information integration. • The designers of e-Science services can use the research on the Experiment Information model to enhance provenance services. • The community of domain researchers can benefit from the Experiment Information model to facilitate knowledge extraction from records of previous experiments, and to enhance experiments reuse.. 1.5 Research Methodology In order to prove the statement of this Dissertation, various approaches are applied. • A prototype implementation of the proposed workflow monitoring architecture – the GEMINI system – has been created in order to demonstrate monitoring of real-life workflows.1 • The proposed solution for enabling on-line monitoring of Grid workflows has been evaluated by applying a model-based approach. An alternative centralized solution was also evaluated for comparison. Two methodologies – queuing networks and discrete event simulation – have been employed to build models of the proposed alternative monitoring system architectures and to evaluate the performance characteristics thereof. The parameters for both models have been captured from the implemented software prototypes, from available information about production Grid infrastructures, and from the performance evaluations published in scientific papers. • A new system for monitoring of parallel legacy applications – the OCM-G – has been conceived and created based on the core of an existing monitoring system – the OCM, and integrated with GEMINI to demonstrate the performance monitoring of an MPI legacy application invoked from a workflow activity, including the selective, fine-grained, runtime instrumentation.2 1 The implementation of GEMINI was created under the supervision of the author by Jakub Dziwisz, Kuba Rozkwitalski, Bartlomiej Labno and Bartosz Kowalewski [175, 14, 13, 27, 176, 20, 21]. 2 The implementation of the OCM-G system was performed by the author with help from Marcin Radecki, Tomasz Szepieniec, Wojciech Rzasa and Krzysztof Guzy [25, 15, 17, 23, 18, 24, 16, 19, 27]..

(25) 7 • Query Translation Tools (QUaTRO), also created under the supervision of the author of this Dissertation, have been used for ontology-based visual querying over workflow execution records and experiment data. The use of those tools proved that the proposed ontology-based model enabled complex end-user oriented querying and a semantic information integration between experiment data and workflow execution metadata.3 • The aforementioned systems and tools – GEMINI, the OCM-G, and QUaTRO – have been deployed in the testbeds of two EU-funded projects – K-Wf Grid and ViroLab. Workflow applications from those projects – a Coordinated Traffic Management (CTM) application (K-Wf Grid), and a Drug Resistance application (ViroLab) have been monitored to collect real-life monitoring data and to build experiment information records.. 1.6 Dissertation Organization This Dissertation is organized in the following way. Chapter 2 presents the conceptual background of this dissertation, covering such topics as e-Science, Grid Computing, Scientific Workflows, Monitoring, and Semantic Grid. Chapter 3 outlines a research roadmap which is the topic of the next chapters. Chapter 4 discusses the related work. In Chapter 5, a monitoring infrastructure for Grid scientific workflows is proposed based on the requirement analysis derived from the specific characteristics of Grid scientific workflows. In Chapter 6, a solution for on-line monitoring of Grid workflows, based on the automatic resource discovery enabled by a Distributed Hash Table infrastructure, is proposed. Chapter 7 describes the problem of monitoring of legacy applications in a Grid, which are frequently used as computational backends of scientific workflows. Chapter 8 introduces the concept of Experiment Information, an information model to represent historic executions of workflows. An ontology model of experiment execution is proposed. The problem of aggregation of low-level monitoring data into high-level ontology individuals constituting experiment information is described. Chapter 9 presents examples of monitoring of scientific workflows to demonstrate basic functionalities of the Grid workflow monitoring infrastructure. The first example concentrates on a complex workflow implementing a Coordinated Traffic Management application, composed of multiple distributed activities. The second example demonstrates a monitoring example which involves a legacy code running in the backend of a workflow task. Chapter 10 demonstrates how execution records from multiple workflow runs are built and used. A medical Drug Resistance application is used to demonstrate how the collected records can be used to collect provenance of results of scientific experiments. Achievements enabled by the experiment information model – ontology-based visual querying, and semantic information integration – are also demonstrated. Chapter 11 contains a model-based performance evaluation of the proposed solution for on-line monitoring based on a DHT infrastructure. For comparison, an alternative centralized solution is also evaluated. Chapter 12 concludes the Dissertation and discusses a future work. 3. The implementation of QUaTRO has been done by Michal Pelczar and Jakub Wach, with contributions from student projects, under the supervision of the author [26, 22]..

(26) 8 Chapter 1. Introduction.

(27) Chapter. 2. Background 2.1 Introduction In many areas of scientific research today, a computer is not only a useful gadget, but an essential piece of equipment, which enables new scientific discoveries. The vision of a next generation cyberinfrastructure for e-Science is one where a scientist can easily design an experiment, which typically would involve scientific data retrieved from distributed repositories, integrated if necessary, and transformed in various ways, in order to deliver a desired result [170]. The cyberinfrastructure would – based on a high-level experiment description – automatically find the required data sources, identify and perform necessary information integration, discover and run proper services to perform necessary data transformation, and return the final result. Several elements play an important role in this vision: • Scientific workflows provide a way for a scientist to describe in silico experiments which are executed as applications on the resources provided by the underlying infrastructure. • Grid technologies provide a unified access layer to underlying computing and data resources. • Semantic Web and Semantic Grid technologies enhance services and data with semantics, indispensable to achieve integration of heterogeneous scientific data or interoperability of services. Application monitoring applied to scientific workflows must take into account the broader context in which scientific workflows are used. The remainder of this chapter introduces the essential concepts that play an important role in the big picture of monitoring of scientific workflows: E-Science, Grid Computing, Workflows, Monitoring, Semantic Web and Semantic Grid.. 2.2 E-Science & Grid Computing The term e-Science [88] was coined to denote a new type of scientific research based on the collaboration within a number of scientific areas, enabled by a next generation infrastructure, wherein people,.

(28) 10 Chapter 2. Background computing resources, data and instruments are brought together to enable breakthrough discoveries and to bring a new quality to the everyday work of researchers1 . The infrastructure in question is usually identified with Grid systems which offer at least two benefits important for large-scale, loosely-coupled, cross-institution research and collaboration: (1) virtualization and sharing of resources [134], and (2) creation of virtual organizations [66]. Until now, there is no single widely agreed-upon definition of the Grid. Rather, there are many attempts to define it, some important ones being as follows: 1. Foster et al. [66] define the Grid as “flexible, secure, coordinated resource sharing among dynamic collections of individuals, institutions, and resources”. 2. In a later article, Foster proposes a three point checklist according to which a Grid system is one that (i) “Coordinates resources that are not subject to centralized control”, (ii) uses “standard, open, general-purpose protocols and interfaces”, (iii) in order to “deliver nontrivial qualities of service” [63]. 3. According to Plaszczak and Wellner [139] the Grid is “the technology that enables resource virtualization, on-demand provisioning, and service (resource) sharing between organizations.” (p. 59). First Grid-based e-Science projects whose aim was to build production Grid infrastructures for scientists started in early 2000s, examples being the Enabling Grids for E-SciencE (EGEE) Project2 which focuses on high energy physics and life sciences, and the UK e-Science Programme, covering multiple areas including biological sciences, medical sciences and particle physics [88]. At present, there are many production scientific Grid infrastructures – national, regional and world-wide. Some examples of those include: • The EGEE infrastructure, covering 250 sites in 50 countries [69]. • OpenScienceGrid, whose goal is to provide access to Grid resources to diverse communities of scientists [141] [142]. • The TeraGrid infrastructure integrates high-performance computers, data resources including disciplinespecific databases, and tools, using high-performance network connections [48]. • The Distributed European Infrastructure for Supercomputing Applications – DEISA – is an initiative to support computational sciences in Europe [112]. The goal of DEISA is to integrate national High Performance Computing infrastructure using modern Grid technologies. • D-Grid is a German national project whose goal is to develop a platform and infrastructure for high performance computing.3 • Worldwide LHC Computing Grid (WLCG)4 brings together computing resources devoted for the Large Hadron Collider (LHC) experiments, located in CERN. This initiative was, in fact, the initial 1. John Taylor’s definition: e-Science is about global collaboration in key areas of science the next generation of infrastructure that will enable it. See http://www.e-science.clrc.ac.uk. 2 http://www.eu-egee.org 3 D-Grid homepage, www.d-grid.de. 4 lcg.web.cern.ch/LCG.

(29) 11 driving force for world-wide computing infrastructures. The resources contributed to WLCG come from numerous Grid infrastructures, including EGEE, OSG, NorduGrid, as well as GridPP in UK, and INFN in Italy.. 2.3 Workows A workflow can be defined as simply as “a partially ordered set of jobs” [1]. Workflow Management Coalition5 defines a workflow as “the automation of a business process, in whole or parts, where documents, information or tasks are passed from one participant to another to be processed, according to a set of procedural rules” [188]. According to van der Aalst [2], the emergence of workflows is a result of three trends in the evolution of information systems. The first trend is that gradually more and more functionality that used to be tailormade for particular applications is now part of a generic, application-independent layer, an example being generic database management systems. The consequence of this trend for system development is a shift from programming from scratch to assembling existing complex software systems. The second trend is a shift from data-driven to process-driven approaches. While twenty years ago the development of applications was centered around data modeling, the modeling of a business process was neglected. Since then, the focus on process modeling is ever-increasing. The last trend is the increasing need for dynamic change and tailoring of existing information systems. Consequently, software must be open to frequent redesign and quick adaptation to changing requirements. As a result of those trends, generic workflow management systems emerged. In workflows managed by generic workflow management systems, core business logic is clearly separated from process control logic ([107, pp. 79-80; pp. 112-114]; [114, p. 13]). Consequently, the two can vary independently. The process model of the application can be changed with a greater flexibility and agility. At an abstract level, workflows can be viewed as graphs where the nodes denote workflow tasks (also known as activities or operations) to be done, while edges represent dependencies between those tasks. Those dependencies are either control-driven or data-driven [160]. A typical control dependency specifies the required sequence of tasks, e.g. that task a must complete before task b. A data dependency, on the other hand, specifies e.g. a data file which must be created by task a before task b can proceed. Particular workflow languages are often hybrid, containing both data and control dependency constructs. A workflow-based application is one built according to the workflow paradigm. The importance of workflow-based applications in the Grid, as a means enabling the users to compose complex Grid applications, has been pointed out [135]. Recently emerged scientific workflows are workflows used by the scientists to “glue” together scientific data and services, in order to facilitate conducting of in silico experiments [120]. Scientific workflows are also viewed as a means of scientific information integration through analytical data processing [118]. They give a domain scientist and a workflow developer a “simple, high-level model for thinking about their analysis pipelines” (Ibid.). In addition, as De Roure and Frey [151] point out, “capturing the experiment as a re-usable digital artifact (...) facilitates sharing and re-use of experiment design”. Scientific workflows differ from business workflows in the following way [30] [120]: • Scientific workflow are typically dataflow-oriented data-centric “analysis pipelines”, whereas business workflows are task-centric and control flow-oriented. 5. Workflow Management Coalition, http://www.wfmc.org.

(30) 12 Chapter 2. Background • Scientific workflows are metadata-intensive, since metadata is particularly important for scientists (e.g. provenance) [74]. • Since scientists are individualistic in their research, scientific workflows are more often ad-hoc, whereas business workflows are typically used for an automation of standard business procedures, and are characterized with a high repetition rate ([114], pp. 10-11). Nevertheless, as pointed out earlier, reuse is also important in scientific workflows. • Scientific workflows are often resource-intensive both in terms of computations and data. There are a number of popular workflow systems for e-Science, for example myGrid / Taverna [144, 136], Kepler [119], Triana [172], Pegasus [57], ICENI [127], ASKALON [62], and MOTEUR / P-GRADE [72].. 2.4 Monitoring Monitoring has been present in computer science since 1960s, and it first emerged in the context of software debugging [154]. There have been many definitions of monitoring, for example: • Snodgrass [166]: “Monitoring is the extraction of dynamic information concerning a computational process, as that process executes.” • Joyce et al. [103]: “The monitoring of distributed systems involves the collection, interpretation, and display of information concerning the interactions among concurrently executing processes.” • Mansouri-Samani and Sloman [125] [124]: “Monitoring is defined as the process of dynamic collection, interpretation and presentation of information concerning objects or software processes under scrutiny.” From the point of view of monitoring subject it has been established to distinguish application monitoring from infrastructure monitoring. The first one concentrates mostly on end-user applications, the latter on the underlying infrastructure which encompasses hardware resources (CPUs, memory, storage, network, etc.), as well as operating system and middleware services. Monitoring can be applied on-line or off-line which denotes the mode in which the collected monitoring data is used – simultaneously with the operation of the monitored entity (on-line) or afterwards, using a collected trace of the monitoring data (off-line). Whichever mode is chosen might have important consequences for the requirements imposed on the monitoring system’s design [35]. Many purposes for monitoring have been identified, the most important ones being as follows (extended from Schroeder [154]): • Debugging and testing. Extracts information, such as data values, from the application being monitored. • Performance evaluation. Extracts data from a system in order to assess its performance. • Performance enhancement. In this context, Schroeder includes dynamic system configuration, dynamic program tuning, and on-line steering. This list must be extended with those performance enhancement cases where historic records are used to improve future executions, e.g. performance prediction..

(31) 13 • Dependability. Includes detection of failures and performance degradations of software. • Control. In this case, monitoring is inherent part of the target system, necessary to provide its functionality. • Correctness checking. Includes status checking, error detection, and checking consistency of the application’s behaviour with a (formal) specification. This aspect gains particular importance in the context of Grid and utility computing, where Quality of Service is guaranteed via performance contracts formalized in Service-Level Agreements (SLA) [152]. Monitoring is essential in order to verify whether the SLAs are met, and to generate alerts or take corrective actions otherwise [183]. • Security. Monitoring is used to detect security violations. Let us note that in order to support some of the above-mentioned purposes, off-line monitoring is sufficient. Others require or are best realized with on-line monitoring (e.g. some forms of performance enhancement, control, correctness checking, dependability). Moreover, in some cases historic records collected from previous monitoring sessions are required (e.g. in certain types of performance prediction). As far as the methods of monitoring are concerned, there are two broad groups widely distinguished: clock-driven monitoring and event-driven monitoring [145]. Clock-driven monitoring, also known as sampling, consists in recording the state of the observed entity at periodic time intervals. Application profiling tools, such as gprof [77] are based on this method. Those tools periodically record the value of the instruction counter and compute, e.g., the time elapsed in the individual procedures of the target program as a proportion to the number of instruction counter hits within the given procedure. Event-driven monitoring is triggered by the occurrences of events and can be divided into timing, counting and tracing [53]. In timing, the time elapsed in a given part of the program is measured. For example, time spent in a procedure call can be measured by subtracting clock values measured before and after the call. In counting, the number of occurrences of a given event is measured. The advantages of counting over timing are lower intrusiveness, and lower memory consumption. Tracing is the most general technique of event-driven monitoring in which all events are recorded. The advantage of tracing is that it collects the most detailed information, from which not only timing and counting, but also other pieces of information can be subsequently derived. However, tracing can also be most intrusive. The discussed aspects of monitoring are illustrated as a monitoring taxonomy in Fig. 2.1. This taxonomy is not meant to be exhaustive. As far as monitoring of Grid workflows is concerned, it can be described as a process that typically encompasses the following phases: • Instrumentation. Inserting probes in order to enable generation of monitoring events during the runtime of the instrumented processes. • Generation. Event detection and creation of monitoring events. • Processing. Simple processing may occur before events are delivered to consumers, for example (extended from [125]): • merging – e.g. combining two traces into one; • aggregation – e.g. computing average, maximum or minimum value; • counting – computing a number of occurrences;.

(32) 14 Chapter 2. Background. Monitoring. Mode. On-line. Subject. Purpose Debugging & Testing. Off-line. Performance evaluation. Performance enhancement. Method. Control. Dependability. Security. Applications. Event driven. Infrastructure. Correctness checking. Timing. Clock driven. Tracing. Counting Dynamic configuration. On-line steering. Dynamic tuning. Figure 2.1: Taxonomy of monitoring • timing – computing a duration of some activities; • filtering – discarding selected events according to certain criteria; • correlation – computing of complex events; • translation – e.g., conversion between representations; • validation – e.g., checking the correctness of time stamps and events IDs. • database update – monitoring information may be used to refresh the database which stores an up-to-date status of the monitored system (e.g. Grid information systems). • Discovery. Finding producers of monitoring data that match a subscription submitted by a consumer. • Dissemination. Registration of subscriptions and delivery of monitoring events to subscribers. • Recording. Storing a record of the execution in a persistent store.. 2.5 Semantic Web & Semantic Grid The Grid is one of two major visions of the future of Internet. The other one is Semantic Web. Semantic Web is described as an extension of the current World Wide Web in which “information is given welldefined meaning, better enabling computers and people to work in cooperation” [34]. While the Grid has been named a future infrastructure for e-Science, high performance computing enabled by the Grid is only one of the pillars of e-Science, the other one being information management. One of the reasons for this is a high need for interoperability in e-Science projects [86]: “(...) interoperability is key to all aspects of scale that characterize e-Science, such as scale of data, computation, and collaboration. For example, to predict the physical properties of a crystal, a chemist may wish to.

(33) 15. correlate a new molecular structure with existing structural databases. We need interoperable information in order to query across the multiple, diverse data sets, and an interoperable infrastructure to make use of existing services for doing this.”. Consequently, only the marriage of the two technologies – Grid computing and Semantic Web – would enable the achievement of the full e-Science vision (Ibid.). This marriage has been named a Semantic Grid [55]. By analogy to the Semantic Web, De Roure and Goble describe the Semantic Grid as “an extension of the current Grid in which information and services are given well-defined meaning, better enabling computers and people to work in cooperation” [75]. The key factor enabling the Semantic Grid vision are the Semantic Web technologies [73]. De Roure and Frey [151] describe a good example of the role the Semantic Web technologies (RDF in this case) can play, based on their experience in the CombeChem Project: “(...) we found that the chemistry researchers were keen to import chemical information directly into the RDF stores. The benefits were the uniform description and the flexible schema afforded by this approach, contrasting the diversity of relational databases where changing schema was impossible or achievable only at very high cost.”. The key Semantic Web technologies are as follows [159]: 1. Universal Resource Identifiers (URI) – provide a global naming convention for identifying resources on the Web. 2. eXtended Markup Language (XML) – provides the basic syntax for Web documents [61]. 3. Resource Description Framework (RDF) – a standard for expressing facts about Web resources as RDF Triples. RDF is the most basic tool to express the semantics of Web resources. [146] 4. Resource Description Framework Schema (RDFS) – a knowledge representation language which introduces basic elements that enable to structure RDF resources. 5. Web Ontology Language (OWL) [138] is a knowledge representation language more expressive than RDFS. OWL is based on XML for syntactical structuring of documents and RDF which provides a data model to describe Web resources, as well as their relationships and attributes, and a simple semantics for this model. Ontologies are a standard, highly sharable, and machine-processable way to represent vocabularies and semantic relationships between entities of a given domain [73]. In the context of knowledge engineering (as opposed to Ontologies in Philosophy), one of the briefest and most frequently cited definition of ontologies is as follows: An ontology is an explicit specification of a conceptualization ([79], p. 199; after [76], p. 6). Conceptualization of a domain using ontologies comprises a vocabulary, definition of concepts, relationships between concepts, and also domain rules which impose restrictions on the semantics of concepts and relationships between concepts [168]. Ontologies may also contain instances of the given domain. Uschold and Grüninger [182] distinguish three categories of uses of ontologies: (1) communication between people, (2) interoperability among systems, and (3) systems engineering: “specification, reliability and reusability”..

(34) 16 Chapter 2. Background. L1. L2. L1. L2. Inter-lingua. L3. L4. L3. L4. Figure 2.2: Ontologies as inter-lingua: the need for translators is reduced from O(n2 ) to O(n). Source: [182] Communication. Ontologies, by providing a shared and agreed conceptual framework, enable “understanding and communication between people with different needs and viewpoints arising from their particular contexts”. Interoperability. An ontology may be used as an inter-lingua to support translation between languages and representations. For example, to support translation for every two of n parties taking part in exchange would require O(n2 ) translators without an inter-lingua and O(n) translators with one (Fig. 2.2). Systems engineering. Ontologies can also enhance the design and development of software systems. A shared problem understanding provided by ontologies can be used as a basis for system specification. Moreover, the reliability of software systems is improved – ontologies can be used for checking the software design with respect to its specification. Ontologies can also support reusability of software modules. Spyns et al. compare data modeling with ontology engineering [168]. The most important differences they point out are as follows: • Genericity. Data models are task-specific and implementation-oriented, whereas ontologies should be as generic and task-independent as possible. • Expressive power. Data models concentrate on the structure and integrity of data. Ontologies must also express domain conceptualization, hence ontology languages contain more constructs to express constraints, such as taxonomy. • Extendibility. Data models reflect a particular universe of a specific application. In ontologies, on the other hand, the modeled subjects are separated from the problem, hence they must be extensible and ready to use in both anticipated and unanticipated manners..

(35) Chapter. 3. Research Roadmap The goal of this chapter is to present a synthetic overview of the key challenges in monitoring of Grid scientific workflows, in order to outline a scientific roadmap which is the topic of next chapters wherein individual problems are approached analytically. Let us look at the problem of monitoring of scientific workflows from a bird’s eye view through a ‘big picture’ depicted in Fig. 3.1. The execution of a workflow is orchestrated by a workflow enactment engine, according to a workflow plan. The workflow enactment engine cooperates with other services, such as a Scheduler in order to map workflow activities onto available Computing Resources. Some workflow activities may invoke legacy applications (e.g. an MPI parallel application running on a local cluster) in order to perform their tasks. Two actors who execute a workflow are presented: a scientist and a workflow developer. They look at the workflow from different perspectives – the former is using workflows to perform in silico experiments, the latter usually views workflows as applications that need debugging, testing or performance analysis. A monitoring infrastructure is responsible for the instrumentation of resources in order to generate monitoring events, discovery of event sources, collection and correlation of those events, and their dissemination to subscribers. Monitoring data is collected from various sources – not only the workflow enactment engine, but also workflow activities, middleware services, resources and invoked legacy applications. The execution of the workflow is recorded according to a structured monitoring information model, called the Experiment Information model, and stored in a persistent Experiment Information Store. A workflow execution record compliant with the Experiment Information model is created from the low-level monitoring events aggregated by a Information Aggregator. The picture presents a few consumers of monitoring data and information: • A Performance Monitoring Tool subscribes in the monitoring system in order to obtain workflow execution events, measure performance of the workflow, and present the measurements to the workflow developer. • The Scientist retrieves information about previous experiments from the Experiment Information Store, including provenance of experiment results..

(36) 18 Chapter 3. Research Roadmap Experiment Information model. Scheduler. Workflow. Extract knowledge. Experiment Information Store. Learn about previous Experiments; Get provenance. Workflow plan. Schedule workflow execution. Service. hasStage Is-a Stage. hasInputhasOutput Input. Output. Store Information. Execute Experiment. Scientist. Workflow Enactment Engine. Performance Monitoring Tool. Examine Workflow Performance. Run Workflow. Information Aggregator. Request Events. Instrument. Workflow developer. Events. Events. Monitoring System Query & Subscribe. Instrumentation Data Collection. Computing Resource A Workflow Activity. Server. Data Representation. Data Discovery. Local Monitor. Computing Resource B Workflow Activity. Server. Local Monitor Server. Server Cluster. Computing Resource C Local Monitor. Workflow Activity. Legacy application. Server Legacy Monitor. Cluster. Legacy application. Legacy Monitor Cluster. Figure 3.1: Monitoring of Grid Scientific Workflows – the big picture. • The Scheduler takes benefit of recorded information about previous workflow runs in order to improve the execution of next runs. This ‘big picture’ vision as well as the specific properties of Grid scientific workflows, outlined in Section 1.1, lead to the following key challenges in monitoring of Grid scientific workflows: 1. Development of a monitoring infrastructure capable of concealing workflow heterogeneity in a common monitoring framework. 2. Support for on-line collection of Grid workflow monitoring data. 3. Monitoring of workflow legacy backends within the monitoring infrastructure..

(37) 19 Problem description & challenges Development of a monitoring infrastructure for Grid scientific workflows.. Support for On-line monitoring of Grid workflows.. Monitoring of workflow legacy backends.. Information model for recording workflow executions.. Solution Design of the monitoring infrastructure as a framework which supports pluggable data sources and instrumentation tools. Adoption of technology- and platformindependent interfaces for monitoring and instrumentation. Design of a standardized hierarchy of workflow monitoring events. Adoption of the Standardized Intermediate Representation in order to support fine-grained, language- and platform-independent instrumentation of workflows. Automatic resource discovery handled by the monitoring infrastructure to support fast discovery of new workflow activities. A DHT-based solution to enable the automatic resource discovery. Model-based performance analysis of the solution in order to verify its scalability. Evaluation of an alternative centralized solution for comparison. Adoption of the OMIS approach to monitoring parallel applications in a Grid. Design and development the OCM-G system for monitoring of legacy parallel tightlycoupled applications. Adaptation of the OCM-G to the Grid workflow monitoring framework. Design of Experiment Information, an ontology-based information model to represent execution of workflowbased in silico experiments, including domain semantics of computations and data used in the workflow.. Table 3.1: Key Grid workflow monitoring challenges and their solutions. 4. Development of an information model for recording workflow executions. Table 3.1 shortly describes the solutions to the key challenges which are addressed in the following chapters: • Monitoring infrastructure for Grid scientific workflows – Chapter 5. • On-line monitoring of grid workflows – Chapters 6 (solution) and 11 (performance evaluation thereof). • Monitoring of workflow legacy backends – Chapter 7. • Information model for recording workflow executions – Chapter 8. The main challenges are shortly described below.. Monitoring infrastructure for Grid scientic workows A monitoring infrastructure has to deliver a number of specific functionalities including the specification and representation of monitoring events, sensor management, subscription management, and instrumentation. Given the heterogeneity of Grid scientific workflows, the monitoring infrastructure must feature.

(38) 20 Chapter 3. Research Roadmap an open design in order to accommodate different monitoring data sources using diverse monitoring paradigms, interfaces, and data representations. In other words, a monitoring infrastructure for Grid scientific workflows should be designed as a framework to which diverse sources of monitoring data, as well as different instrumentation tools and technologies can be adapted.. On-line monitoring of Grid workows On-line monitoring of a workflow is essential in several scenarios, the more so in a Grid environment which is characterized with time-varying resource availability. Grid workflows are distributed, dynamic and have a complex execution cycle. Grid workflows are usually specified in an abstract way, without binding any particular resources to workflow tasks. This binding is done dynamically, just before the workflow’s execution, or even during its execution. The tasks are mapped onto resources by scheduling services, based on the current status of those resources. In many cases, the scheduling services used for Grid workflows have a decentralized architecture which complicates the process of mapping tasks to physical resources [190]. This complexity makes the resource discovery a particular challenge for workflow monitoring, since new workflow activities are typically dynamically created on dynamically-assigned resources. In order to support on-line monitoring of Grid workflows, automatic resource discovery is required wherein consumers are automatically notified about new producers. Since there might be no single service to provide up-to-date information about all workflow tasks, their discovery must proceed in a bottom-up manner, i.e. first new workflow tasks should be discovered locally, then advertised globally, and then discovered globally in order to take some action. However, traditional Grid information systems used for resource discovery, are not suitable for fast notification-based discovery scenarios. They are usually oriented towards high query performance and high query scalability achieved by caching of information. Consequently, a distributed hash table (DHT) infrastructure, federated with the monitoring infrastructure, is proposed to store the shared state of the distributed monitoring services, and in this way to enable efficient automatic resource discovery. An alternative centralized approach, based on a central coordinator is also conceived and built in order to compare it with the DHT-based one. The two proposed solutions are evaluated using a model-based performance analysis.. Monitoring of workow legacy backends Despite the fact that the challenge to handle monitoring of legacy applications within scientific workflows can be, to some extent, described in terms of other challenges (e.g. heterogeneity), the presence of computationally-intensive legacy applications in the backends of workflow activities – distinctive to scientific workflows – deserves a particular attention. Legacy applications usually constitute the computational core of the workflow. Yet, they are developed in programming languages and technologies whose support in modern execution platforms is limited. Though themselves parallel, tightly-coupled and homogeneous, legacy applications introduce a further heterogeneity into the workflow execution. The monitoring infrastructure must deal with legacy codes in a uniform way. Thanks to its design as a framework, the monitoring infrastructure is prepared to accommodate monitoring tools and technologies for legacy applications. However, the complexity of the workflow legacy backends – MPI parallel applications being a typical example – makes their monitoring challenging in itself. Furthermore, the contexts of the Grid environment, the workflow, and the broader monitoring framework, introduce some new issues into the monitoring of legacy applications. Therefore, this problem is treated separately in this Dissertation. An existing OMIS approach to monitoring parallel.

(39) 21 applications is extended to support a Grid environment. An OMIS-Compliant Monitoring system – the OCM-G – is developed on the basis of existing system for clusters – the OCM. Finally, the OCM-G is integrated with the monitoring infrastructure for Grid workflows as a proof of concept.. Recording workow executions Historic records of past application executions are used for various types of retrospective analysis, such as performance prediction based on past application’s behaviour [183]. Taking benefit of previous runs typically involves extraction of actionable information (knowledge) from multiple records of previous application executions. To enable this, the execution records need to be structured according to a common information model. However, in the case of scientific workflows the records of previous executions gain a particular importance. For any given result of an in silico experiment performed by executing a workflow, it is essential to record the provenance of this result, i.e., the process which led to that result. Researchers, when using previous scientific results for subsequent experiments or for making decisions, first examine provenance of those results in order to assess whether the result has been properly obtained and whether it is trustworthy. Furthermore, a mining over many records of scientific experiments can provide a researcher with valuable insights and can be an important source of knowledge useful in his research. In other words, the records of scientific workflow executions are end-user oriented in the sense end-users are direct consumers of those records. Therefore, the information model to store workflow execution records must also be end-user oriented, for example in that it should reflect the domain semantics of computations and data used in the workflow. This model must also support discovery of provenance of scientific results. The idea to include the information about semantics of workflow operations and data, using well-defined ontologies, also goes in line with the emphasis of the role of Semantic Web in e-Science, and the vision of a Semantic Grid as a future infrastructure for e-Science [55].

(40) 22 Chapter 3. Research Roadmap.

(41) Chapter. 4. Related Work 4.1 Introduction While Grid monitoring services are considered one of the most important elements of a Grid infrastructure [94], most Grid monitoring systems concentrate on monitoring of Grid resources, for example Nagios [99], JIMS [29], MDS [153], GridRM [165] or GridICE [4]. There are few systems that address monitoring of Grid scientific workflows, even fewer attempt to develop an information model for application execution monitoring. The following overview of related works is divided into two parts. The first one describes Grid application monitoring approaches; the second one, data models for recording application executions.. 4.2 Monitoring of Grid Applications GRM & Mercury Mercury Monitor and GRM tool have been used to monitor parallel message-passing applications in the Grid [140]. GRM is a low-level tool capable of collecting events from application processes and their dissemination to consumers. The architecture of GRM includes Local Monitors which collect events from local processes, and a Main Monitor which collects the event traces, and disseminates them to consumers. In order to adapt monitoring functionality of the GRM to the Grid, the Mercury Monitor has been introduced which features Local Monitors, Main Monitor, and a Monitoring Service which enables access to the monitoring functionality from remote sites. However, as shown in [140], the adaptation required substantial reengineering and reimplementation effort. Local Monitors of GRM could not be used. Instead, the instrumentation library had to be rewritten in order to use Mercury API and support the event trace format of Mercury. This is a consequence of tight coupling of metric definitions with the monitoring API, instead of focusing on information exchange through published schemas of monitoring events. Application monitoring events are published as metric called application.message. Individual appli-.