Index of /rozprawy2/10828

Pełen tekst

(1)University of Science and Technology AGH in Kraków Faculty of Computer Science, Electronics and Telecommunications Department of Computer Science. Doctoral Thesis. Unified Metadata Management in Large Distributed Computing Infrastructures Author: mgr inż. Bartosz Kryza. Supervisor: Prof. dr hab. inż. Jacek Kitowski Supporting supervisor: dr inż. Renata Słota. Kraków, Poland June 25, 2014.

(2)

(3) Akademia Górniczo-Hutnicza im. Stanisława Staszica w Krakowie Wydział Informatyki, Elektroniki i Telekomunikacji Katedra Informatyki. Rozprawa doktorska. Zarządzanie Zunifikowanymi Metadanymi w Rozproszonych Infrastrukturach Komputerowych Dużej Skali Autor: mgr inż. Bartosz Kryza. Promotor: Prof. dr hab. inż. Jacek Kitowski Promotor pomocniczy: dr inż. Renata Słota. Kraków, June 25, 2014.

(4)

(5) ... The mysterious processing power didn’t come from the computer itself, but from the entire network. As you know, it currently consists of over 40,000 computing centres, and as you probably don’t know (at least I didn’t until Hart enlightened me), it has a hierarchical structure, similar to that of a mammals’ nervous system. The network has state nodes, and the memory of each of them stores more facts than all living scientists all together. Every customer pays a fee, proportional to the use of computing power consumed every month, with proper factors, since if a problem that the user wants to solve is too hard for his nearest computer, distributor automatically gives him additional resources from federal reserves, that is computers with low loads. This distributor, naturally, is also a computer. It takes care of balancing the overall load of the network and keeps secure the so called restricted memory banks, that is classified data, such as government, military and so on... ... The computer network resembles the electrical grid, but instead of current it provides information. The flow, both in the electrical and computer networks, is similar to that of the water in connected pipes. The current flows according to the direction of minimal resistance. .... Stanisław Lem, 137 seconds Wydawnictwa Literackie, Kraków, Poland, 1976 (transl. Bartosz Kryza).

(6)

(7) Abstract The beginning of the 21st century has brought an unsurpassed dependence of almost every aspect of human existence on availability and performance of computer information systems. Unfortunately, the amount and complexity of resources available online makes it more and more difficult to discover and use ones most matching specific criteria. One of the main reasons for this situation is that most data and resources available online do not have sufficient annotations, i.e. metadata. The main goal of this thesis is development of a new approach to metadata management in large distributed computing infrastructures by using Semantic Web technologies. One of the assumptions of this thesis is that metadata should not be only treated as information about data, but also about any resources such as services, processes. In order to provide means for effective usage of such metadata it is necessary to apply appropriate representation for metadata and solution for metadata management in a scalable and secure way enabling controlled evolution of this knowledge. It is in particular very important aspect of large computing infrastructures comprised of heterogenous subsystems handling various aspects of the entire infrastructure.. Streszczenie Pierwsza dekada XXI wieku przyniosła znaczące uzależnienie praktycznie każdego aspektu ludzkiego życia od dostępności i efektywności systemów informatycznych. Niestety ilość oraz stopień różnorodności zasobów dostępnych w sieci powoduje coraz większe problemy z dostępem do informacji i zasobów odpowiadających konkretnemu żądaniu. Jedną z głównych przyczyn takiego stanu rzeczy jest fakt, że większość danych i zasobów dostępnych w sieci nie posiada odpowiednich adnotacji, czyli metadanych. Celem tej pracy jest opracowanie nowego podejścia do zarządzania metadanymi w rozproszonych infrastrukturach komputerowych dużej skali z wykorzystaniem technologii Sieci Semantycznej. Jednym z założeń tej pracy jest znaczenie szeroko rozumianych metadanych w takich infrastrukturach, mianowicie metadanych rozumianych jako dodatkowe informacje nie tylko o danych, ale również zasobach, usługach czy procesach. Aby zapewnić efektywne zastosowanie tak rozumianych metadanych konieczne są odpowiednie formalizmy do ich reprezentowania oraz platforma zarządzająca nimi, czyli przechowująca metadane w sposób skalowalny, bezpieczny oraz umożliwiajcy kontrolowaną ewolucję tak gromadzonej wiedzy. Jest to szczególnie istotne w infrastrukturach informatycznych dużej skali składających się z heterogenicznych podsystemów obsługujących różne aspekty infrastruktury..

(8)

(9) Contents. 1. 2. Introduction . . . . . . . . . . . . . . . . . . . . . .. 1. 1.1. Context and Motivation . . . . . . . . . . . . . . . . . .. 5. 1.2. Thesis . . . . . . . . . . . . . . . . . . . . . . . .. 6. 1.3. Contribution and research highlights. . . . . . . . . . . . . .. 7. 1.4. Dissertation structure . . . . . . . . . . . . . . . . . . .. 8. 1.5. Acknowledgments . . . . . . . . . . . . . . . . . . . .. 9. Grid Computing, Services and Cloud infrastructures . . . . 11 2.1. Evolution of Grid Computing . . . . . . . . . . . . . . . .. 11. 2.2. Virtual Organizations . . . . . . . . . . . . . . . . . . .. 15. 2.2.1. Models of Virtual Organizations. .. .. .. .. .. .. .. .. .. .. .. .. .. 16. 2.2.2. Contracts in Virtual Organizations .. .. .. .. .. .. .. .. .. .. .. .. .. 18. 2.2.3. Virtual Organizations in Grid Computing.. .. .. .. .. .. .. .. .. .. .. 19. Existing Grid Environments . . . . . . . . . . . . . . . . .. 20. 2.3. 2.4. 2.3.1. Globus Toolkit .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 21. 2.3.2. UNICORE. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 22. 2.3.3. gLite .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 23. 2.3.4. QosCosGrid .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 24. Most important standards and technologies in Grid computing. . . . .. 24. 2.4.1. OGSA .. .. .. .. 25. 2.4.2. WS-Resource Framework .. .. 2.4.3. Security .. .. 2.4.4. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 26. .. .. .. .. .. .. .. .. 27. Information and monitoring services .. .. .. .. .. .. .. .. .. .. .. .. 28. 2.5. Interoperability between Grid frameworks . . . . . . . . . . . .. 29. 2.6. Service Oriented Architectures. . . . . . . . . . . . . . . .. 30. 2.6.1. SOAP, WSDL, UDDI and WS-I. .. .. .. .. .. .. .. .. .. .. .. .. .. 32. 2.6.2. BPEL and WS-CDL .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 33. 2.6.3. REST .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 35. 2.6.4. Enterprise Service Bus. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 35. 2.6.5. P2P Computing.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 36. 2.7. Emergence of Cloud computing paradigm . . . . . . . . . . . .. 37. 2.8. Cloud interoperability issues. 39. .. .. .. .. .. . .. .. . .. . .. . .. .. .. .. .. .. .. . . . . . . . . . . . . . . . ..

(10) vi. 3. Metadata and ontologies. . . . . . . . . . . . . . . . . 43 3.1. Metadata in general . . . . . . . . . . . . . . . . . . .. 43. 3.2. Metadata in computing systems . . . . . . . . . . . . . . .. 46. 3.2.1. Simple Knowledge Organization System (SKOS) .. .. .. .. .. .. .. .. 46. 3.2.2. GLUE Schema .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 47. 3.2.3. Common Information Model.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 47. Knowledge representation formalisms . . . . . . . . . . . . .. 49. 3.3. 3.4. 4. CONTENTS. .. .. 3.3.1. Description Logics .. .. .. .. .. .. .. .. .. .. .. .. .. .. 49. 3.3.2. OWL – Web Ontology Language .. .. .. .. .. .. .. .. .. .. .. .. .. 50. 3.3.3. OWL 2. .. .. .. .. .. .. .. .. .. .. .. .. 51. Semantic Web Knowledge Bases . . . . . . . . . . . . . . .. 52. .. .. .. .. .. .. .. . .. . .. . .. .. Unification of Metadata in Distributed Computing Infrastructures Through Ontologies . . . . . . . . . . . . . . . . 55. 5. 4.1. Motivation for metadata unification . . . . . . . . . . . . . .. 55. 4.2. Description Logics as metadata representation formalism. . . . . . .. 56. 4.3. The Ontology Separation Scheme. . . . . . . . . . . . . . .. 57. 4.4. Generic Grid Ontologies . . . . . . . . . . . . . . . . . .. 62. 4.4.1. Ontology of Resources. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 62. 4.4.2. Ontology of Data .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 62. 4.4.3. Ontology of Services .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 67. 4.5. Example use case . . . . . . . . . . . . . . . . . . . .. 69. 4.6. Integration of Common Information Model metadata scheme to OWL . .. 71. Managing Semantic Metadata with Grid Organizational Memory 77 5.1. The requirements for a Grid Semantic Metadata Repository. . . . . .. 77. 5.2. The Architecture of Grid Organizational Memory . . . . . . . . .. 78. 5.3. GOM Architecture . . . . . . . . . . . . . . . . . . . .. 78. 5.3.1. GOM Engine. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 79. 5.3.2. Engine Manager. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 81. 5.3.3. Deployer .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 83. 5.3.4. GOMAdmin .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 83. 5.3.5. Proxy .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 84. 5.3.6. Protege plugin GOMTab .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 85. 5.4. Usage scenarios . . . . . . . . . . . . . . . . . . . . .. 85. 5.5. Performance evaluation . . . . . . . . . . . . . . . . . .. 86. 5.5.1. GOM Event System and knowledge evolution .. .. .. .. .. .. .. .. .. 88. 5.5.2. GOM Peer-to-peer distribution model .. .. .. .. .. .. .. .. .. 91. Interfacing Legacy Metadata Systems . . . . . . . . . . . . .. 98. 5.6.1. 99. 5.6. .. .. .. .. .. LDAP integration to semantic knowledge base .. .. .. .. .. .. .. .. ..

(11) vii. CONTENTS. 5.7. 6. 7. Example interoperability analysis . . . . . . . . . . . . . . . 101. Semantic Framework for Virtual Organizations . . . . . . . 107 6.1. Semantic approach to Virtual Organizations . . . . . . . . . . . 107. 6.2. Framework architecture . . . . . . . . . . . . . . . . . . 108. 6.3. Support for contract negotiation . . . . . . . . . . . . . . . 112 6.3.1. Contract negotiation model .. .. .. .. .. .. .. .. .. .. .. .. .. .. . 112. 6.3.2. Contract negotiation interface .. .. .. .. .. .. .. .. .. .. .. .. .. . 119. 6.4. Security and Authorization Issues. . . . . . . . . . . . . . . 123. 6.5. Example application of the FiVO framework . . . . . . . . . . . 126. Summary . . . . . . . . . . . . . . . . . . . . . . . . 129 7.1. Research contribution . . . . . . . . . . . . . . . . . . . 129. 7.2. Thesis discussion . . . . . . . . . . . . . . . . . . . . 130. 7.3. Future work . . . . . . . . . . . . . . . . . . . . . . 132. Author’s Publications . . . . . . . . . . . . . . . . . . . . . 133 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . 141.

(12)

(13) List of Figures. 1.1. Thesis logical structure . . . . . . . . . . . . . . . . . . . . . . . . . . .. 6. 2.1. Grid Layered Architecture [190] . . . . . . . . . . . . . . . . . . . . . .. 13. 2.2. Virtual Organizations in PL-Grid, Polish national Grid infrastructure.. 20. 2.3. A high level view of Globus Toolkit components [187] . . . . . . . . .. 22. 2.4. UNICORE architecture [205] . . . . . . . . . . . . . . . . . . . . . . .. 23. 2.5. gLite architecture [197] . . . . . . . . . . . . . . . . . . . . . . . . . . .. 24. 2.6. Overview of OGSA requirements and capabilities . . . . . . . . . . . .. 25. 2.7. SOA Logical Architecture Model [94] . . . . . . . . . . . . . . . . . . .. 31. 3.1. UML representation of basic StorageService concept in GLUE Schema.. 47. 4.1. Current metadata solution versus unified ontological approach. . . . .. 58. 4.2. Separation scheme architecture. . . . . . . . . . . . . . . . . . . . . . .. 59. 4.3. Example of encouraging inter-domain definitions of concepts with unified metadata model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 61. 4.4. Excerpt of the resource ontology. . . . . . . . . . . . . . . . . . . . . .. 62. 4.5. Excerpt of the computing resources ontology. . . . . . . . . . . . . . .. 63. 4.6. DataObject - the core concept of the Data Ontology. . . . . . . . . . .. 64. 4.7. The basic hierarchy of classes in storage aspect. . . . . . . . . . . . . .. 66. 4.8. The basic hierarchy of classes in format aspect. . . . . . . . . . . . . .. 67. 4.9. Core concepts in OWL-S service ontology. . . . . . . . . . . . . . . . .. 68. 4.10 Generated ManagedElement hierarchy in original Managed Object Format (left) and in OWL (right). . . . . . . . . . . . . . . . . . . . .. 76. 5.1. Example of GOM infrastructure. . . . . . . . . . . . . . . . . . . . . .. 79. 5.2. Architecture of single GOM Engine component. . . . . . . . . . . . . .. 82. 5.3. Sample screenshot of GOM Admin webpage with a list of GOM Engines. 84. 5.4. Sample screenshot of GOMTab with list of changes to commit. . . . .. 85. 5.5. Internal GOM Engine overhead including Jena processing time. . . . .. 87. 5.6. Times of adding 100 DataObject instances for different configurations.. 88.

(14) x. LIST OF FIGURES. 5.7. Query times for different configurations. . . . . . . . . . . . . . . . . .. 89. 5.8. Reasoning times for different configurations. . . . . . . . . . . . . . . .. 90. 5.9. Change ontology hierarchy based on action type . . . . . . . . . . . .. 92. 5.10 Change ontology hierarchy based on action domain . . . . . . . . . . .. 93. 5.11 Overview of the GOM P2P distribution model. . . . . . . . . . . . . .. 95. 5.12 Query time for a single node [47] . . . . . . . . . . . . . . . . . . . . .. 97. 5.13 Response time as a function of number of nodes [47] . . . . . . . . . .. 97. 5.14 Overall design of X2R tool [32]. . . . . . . . . . . . . . . . . . . . . . .. 98. 5.15 Application of X2R tool in a Grid environment. . . . . . . . . . . . . . 100 5.16 A job submission scenario from EGEE to VEGA. . . . . . . . . . . . . 103 6.1. Overall FiVO architecture. . . . . . . . . . . . . . . . . . . . . . . . . . 110. 6.2. Contract Ontology and its dependencies . . . . . . . . . . . . . . . . . 111. 6.3. The possible types of statements in the Contract Ontology . . . . . . . 113. 6.4. Contract negotiation process states . . . . . . . . . . . . . . . . . . . . 114. 6.5. Knowledge Database perspective [31] . . . . . . . . . . . . . . . . . . . 120. 6.6. VBE Browser perspective [31] . . . . . . . . . . . . . . . . . . . . . . . 121. 6.7. Organization Definition perspective [31] . . . . . . . . . . . . . . . . . 121. 6.8. Negotiations perspective [31] . . . . . . . . . . . . . . . . . . . . . . . 122. 6.9. Globus container and FiVO authentication and authorization . . . . . 124. 6.10 Apache with FiVO authentication and authorization . . . . . . . . . . 124 6.11 Globus with FiVO authentication and PERMIS authorization . . . . . 125.

(15) 1 Introduction. The first decade of the 21st century has brought an unsurpassed dependence of almost every aspect of human existence on availability and performance of computer information systems. Everything from space shuttles, airplanes, entire governments and corporations through consumer products such as cars, cellphones down to smallest sensors produces data, communicates with other systems over some protocol, have some sort of unique identifier, reacts to events in the environment generating constantly vast amounts of data and enabling human beings to live more dynamically, with almost instant access to information, video and audio content from any place in the world any time of day. What’s more, most day to day activities involve some sort of interaction with IT infrastructure, including weather forecasts, communication, traveling, trading or publishing. Each of these activities is related to invocation of some sort of Internet service, which runs on some computing hardware, provides some sort of interface and performs some action or produces some output. Unfortunately this massive increase in functionality and content variety available online causes people to have more and more problems in grasping its potential. It is already more and more difficult to find the most suitable information for a given problem (page ranking by web search engines is based on keywords and is often tampered with). First of all, its impossible to get aggregated or analytical answers to questions (all data must be processed individually). Furthermore, most data generated and stored is not appropriately (if at all) annotated thus rendering it in fact useless in terms of automated discovery and processing (to a computer there is no difference in meaning between text document and a video file). Even such basic activity as browsing emails would not be possible with increasingly complicated mail clients and their advanced filtering and searching mechanisms. According to IDC report [69], year 2007 was the first year when the amount of data produced has exceeded the worlds storage capacity, which was then estimated at.

(16) 2 around 281 exabytes (i.e. 281 billion gigabytes). An updated version of this report from 2011 [68] estimates this figure at 1.8 zetta bytes. A more thorough evaluation of computational capacity of existing infrastructure, presented in [71], showed that as of 2007, all computers in the world can execute 6.4 Tera MIPS (Millions of Instructions Per Second), i.e. 1018 general instructions per second, have transmitted 2.1 zetta bits of data (1021 ) and stored around 295 exabytes of compressed data. According to another report [63], American citizens themselves have consumed in 2008 around 3.6 exabytes of data from various sources, including newspapers as well as the Internet, with around 60% of this number constituted by the TV broadcasting. These numbers show that even with the current increase in computational power and storage capacity, we will not be able to efficiently store and process all generated data and handle all users requests. According to the website on Internet statistics [56], there are as of February 2012, over 2,000,000,000 users (i.e. almost one third of the world population). In November 2013 there were already 31 operational petascale computing systems [76], with the best Tianhe-2 reaching over 33 petaflops (as measured by Linpack test) which remained in the first place also in the June 2014 Top 500 ranking. Scientists and industry are however already laying plans for the exascale systems, i.e. systems with thousandfold increase in computing power in comparison to the most powerful systems today [59, 75]. In fact a detailed report on the future of large scale computing and the roadmap to exascale infrastructures by 2020 is available since 2010 [65]. The overall statement of the report is that the current software and hardware architecture for large scale systems are completely inadequate to the envisioned requirements of users and applications by the end of 2020, including support for massive parallelism (billions of threads per application) and handling increasing probabilistic operation of transistors at lower margin errors (lower voltages and higher density). Similar report prepared for DARPA [61], has similar conclusions, including prediction that by 2015 embedded systems will have a terascale capacity, i.e. will enable the possibility of completely different types of applications e.g. on a laptop or cell phone. The massive amounts of data produced have led many researchers and visionaries to announce a qualitative change in the way computer systems and scientific research will have to be conducted. In [57], author states that with such amount of data, creating models and testing them is obsolete and only unbiased statistical analysis of data can give sensible results. Although this statement had been criticized substantially by the scientific community, [73] as overly simplistic and not relevant to science but maybe only to business scenarios, it shows an important issue that most important practical results, either scientific or commercial, are found through complex data analysis of vast amounts of data. On the other hand, several notable scientists and visionaries agree with the fact that most important scientific research results will be obtained through data intensive science, the so called 4th paradigm [60]..

(17) 3 It is clear, that all these challenges can be only addressed if the computers, services and IT infrastructure gain substantially higher degree of autonomy: • First of all, all elements involved in communication between services such as identifiers, message formats, text and multimedia files, devices their attributes are based on syntactic datatypes, e.g. simple text strings or numbers, which are easily distinguished and analyzed by humans and custom algorithms developed by them, but are meaningless to universal computer programs, • Secondly, the functionality of the available services must be implemented in a way allowing computer systems to automatically find new services and chain them together in order to fulfill optimally particular requests. Currently only hardcoded dependencies between services, including often cumbersome translation of intermediate communications from different formats significantly discourage such functionality, • Finally, the IT infrastructure itself must be adaptable, flexible and allow computer systems to decide where, when and how to execute algorithms, deploy services and store data. Technological foundations for all these challenges are already maturing, however they seem to be still far away from reaching their full potential and supporting most of the Internet activities today. Since the end of the last century, it has become clear that huge scientific experiments, such as Large Hadron Collider will require much more computational capacity and storage space than could be facilitated in a single computing center. Grid systems are by definition composed of large amount of geographically distributed and possibly replicated, heterogeneous resources not subject to centralized control [190]. Due to this fact the need for robust searching and selection between them becomes a critical issue. This can be addressed properly only when sufficient methods for describing these resources exist within the Grid. The analogy of the Grid computing model to the electrical Grid has often been used in order to stress the main goal i.e. instant availability of computational power on demand. Although Grid computing has brought several technological improvements especially in distributed computing, it was still somewhat cumbersome to use and setting up data centers supporting the infrastructure and mostly command line access were limiting factors in fostering the high performance computing power for the masses. This problem has been addressed by the wide adoption of the virtualization technology, which although has existed in computing systems since the IBM Research M44 machine [72], only in the last decade has been widely adopted as core technology enabling flexible and cost effective management of large data centers [58], giving rise to the Cloud computing model..

(18) 4 The virtualization of the hardware aspects of the IT infrastructures has been paralleled by the increasing virtualization and abstraction of software and communication protocols. The software at the end of the last century has still been mostly developed in low level programming languages. This meant tight coupling between the components which were usually statically bound during linking phase, strong typing limiting the use of software components to only declared data types and tight coupling between the components and applications through carefully defined interfaces. The communication, due to network and computational capabilities of the systems were mostly performed in some low level binary format such as CORBA or DCOM [62]. With the gradual adoption of higher level programming languages such as Java or C# [1], operating not directly on the hardware but through a software virtualization layer, the component coupling moved to runtime and communication protocols started to emerge based on more human readable protocols such as SOAP [89]. Recent widespread adoption of weakly typed dynamic languages such as Python [66] or Ruby [67] and their interconnection through protocols often not requiring any a priori interface definition, such as REST and JSON [74] show that the trend toward full abstraction of the programming and communication models will lead to much more requirements from the software middleware components to decide which software components can and should be connected for a given request and that much more information must be provided on the metadata level of the component rather than on the static interface definition. Although the communication protocols between software subsystems are becoming more and more abstract and high-level, still little help can be obtained from computer systems in terms of automation of discovery, translation or chaining of components due to the fact that their description is purely syntactical, without any assigned semantic meaning. This problem has also been realized already in the beginning of the century [151, 152, 154], and several research groups from the fields of knowledge management and computer intelligence have collaborated under the umbrella of the Semantic Web technologies to address this issue. The main driving idea was to develop knowledge modeling languages (e.g. RDF and OWL), knowledge management systems and semantic communication protocols which would enable to assign meaning to the rapidly increasing amounts of data available online such as web pages and web services, and thus allowing computers to gain intelligence by implementation of tools able to process the semantic metadata assigned to the syntactic objects such as data types or text. Since metadata of any kind are basically sets of facts about some domain, it is possible to find a unified way of describing them. One of the solutions for this problem lies in the use of ontologies. Ontology-enhanced software systems have gained throughout the last years at least as much attention as Grid and Cloud computing and lots of research has been done in this field that can be reused [151]. Semantic Web has.

(19) 1.1. CONTEXT AND MOTIVATION. 5. been often criticized about its complexity, lack of widespread adoption and substantial burden necessary for development of service and provision of content based on semantic technologies [157, 158, 161]. However, it still provides several functionalities and added value which combined with the virtualization of hardware through Cloud computing and virtualization of software components through web service technologies has the potential of bringing truly autonomous and dynamic computer systems, able to react to events, discover necessary facts and adapt to changes in the environment as well as evolving user requirements. It is clear that, even if particular technologies developed and promoted through the Semantic Web projects may not be optimal and not ever get substantial adoption, it is clear that semantic information must be this way or another introduced into the IT infrastructures. Thus all these technologies, when combined together have the potential to significantly improve the evolution of large scale computer systems by allowing development of adaptable, flexible and intelligent software systems [70]. Along with the virtualization and abstraction of hardware, software and information itself, the natural consequence is the virtualization of the concept of organization or any enterprise, be it a scientific research project (as in Grid computing [190]) or a complex industrial production chain [221]. This leads us to the notion of the Virtual Organization, another major concept identified at the end of the last century [226, 227, 234], one of the most important still not fully realized idea from the early stages of Grid computing [190]. This thesis proposes a novel approach to addressing the issue of metadata management in large computing infrastructures through unification of metadata representations using Semantic Web technologies. A significant notion of this thesis is the importance of metadata in such environments, that is metadata considered as any information and knowledge not only about data assets but also any resources referenced from the level of computing infrastructure. In order to effectively use such metadata, sufficient functionality must be provided by the middleware layer in terms of metadata management, i.e. provision of secure, expressive means for storage, searching and evolution of metadata. This is in particular an important issue in large scale distributed infrastructures, where either heterogeneous technologies are used for metadata representation or their conceptualizations between different parties are incompatible.. 1.1. Context and Motivation. As discussed in the previous section, the research efforts in the recent years have provided a potential for qualitative improvement in the way large scale distributed IT infrastructures could be organized, managed, configured and monitored. The Grid and Cloud computing, along with commonplace and efficient virtualization technologies allow for abstraction of the hardware infrastructure both in terms of computing power,.

(20) 6. 1.2. THESIS. Semantic Framework for Virtual Organizations Managing Semantic Metadata with Grid Organizational Memory Unification of Metadata in Distributed Computing Infrastructures. Figure 1.1: Thesis logical structure storage as well as networking. The Service Oriented Architectures bring the possibility of virtualizing software components, allowing for high degree of separation of concerns and flexibility in interconnecting, invoking and matchmaking different functionalities available. The Semantic Web efforts brought to some extent the possibility of high degree of abstraction of metadata and all kinds of resource descriptions and formalizing them in order to allow computer programs to process and reason over such information. The main focus of this work is to prove that semantic approach can significantly improve existing large scale distributed computing infrastructures through unification of metadata representation and management. This included application of Semantic Web technologies to unification of metadata in heterogeneous large scale systems, support for knowledge management by design and development of a distributed semantic knowledge base or creation of semantic Virtual Organization management framework. Figure 1.1 presents a logical structure of this thesis contributions, where consecutive layers build on top of results of preceding chapters. All this work has provided several results and ideas on making applications in large computing infrastructures more adaptive, elastic and autonomous.. 1.2. Thesis. The core thesis of this dissertation is as follows: Unification of metadata in large distributed computing infrastructures using Semantic Web technologies can significantly improve resource discovery and management aspects of Grid and Cloud environments, and in particular enable dynamic establishment of Virtual Organizations over heterogeneous IT platforms. The thesis is proved in the following steps:.

(21) 1.3. CONTRIBUTION AND RESEARCH HIGHLIGHTS. 7. 1. by proposing a novel approach to metadata unification through Semantic Web technologies which enables improved resource management in large scale computing infrastructures, including data and service discovery and matchmaking as well as implementing it in the form of a set of ontologies. 2. by showing how such semantic metadata can be managed in a distributed manner, an ontological knowledge base design and implementation are presented supporting such features as distribution model, knowledge evolution support and semi-automatic import of legacy metadata along with performance evaluation, 3. by showing how the proposed unified semantic metadata approach and distributed knowledge base can be used to enable establishment of contract based Virtual Organizations spanning over heterogeneous IT infrastructures.. 1.3. Contribution and research highlights. The thesis of this dissertation requires a comprehensive research approach covering all aspects of mentioned issues including among others definition and development of semantic metadata in the form of ontologies, development of a knowledge base system for management of the metadata in distributed settings, creation of Virtual Organization management framework and several other auxiliary research efforts which are presented in more detail in this thesis. In summary, the main contribution highlights of this research include: • Specification, development and evaluation of modular metadata unification scheme, supporting development and management of information and resources in large scale IT infrastructures such as Grid environments, covering such issues as data management, service discovery and matchmaking and workflow composition [2, 3, 9] • Design and development of distributed semantic knowledge base supporting flexible configuration and integration through web service interfaces, reasoning and knowledge evolution through custom protocol for knowledge modification, including a design of P2P distributed architecture [6, 10–12, 15, 17, 18] • Design and development of a methodology for converting data modeling standards based on Managed Object Format (MOF) to Web Ontology Language (OWL), such as popular standard Common Information Model (CIM) [13, 16] • Design of system for semi-automatic translation of legacy metadata stored in relational databases, LDAP directories or XML files into ontological form [32, 42].

(22) 8. 1.4. DISSERTATION STRUCTURE. • Design and development of a Virtual Organization management framework supporting such aspects of distributed collaborations based on distributed IT infrastructures as partner selection, goal definition, contract negotiation user management, data management, service annotation and resource discovery and dynamic Virtual Organizations inception and management, based on certificates and role based authorization rules, including automatic generation of security policies from semantically declared contract statements amd specification and development of contract negotiation framework including formal contract specification and negotiation process [19, 20, 22, 24, 25, 27–29, 31, 33, 35, 36] The breadth of the research has been focused on the aspect of the possibility of combining the emerging technologies in order to support collaborative, semantically aware distributed IT infrastructures of the future.. 1.4. Dissertation structure. The structure of the thesis is as follows. The first section is the introduction which presents overall context of the work focusing on the 3 major research and technological efforts related to large scale computing systems i.e. Grid and Cloud computing paradigms, Web Service technologies and Semantic Web. Sections 2 and 3 can be considered as state of the art, discussing in detail existing results in the respective fields related to this thesis. Remaining sections present the original contributions of the author, presented according to the diagram in Figure 1.1. Section 4 presents authors results on application of Semantic Web technologies for metadata unification in Grid and Cloud systems, based on research performed in several national and international projects. Chapter 5 presents a knowledge management system build by the author supporting several features required by the semantic unification approach presented in section 4. Chapter 6 presents application of the Semantic Web technologies to the area of Virtual Organizations by means of versatile framework for deployment and management of such entities. Finally, chapter 7 summarizes the research efforts of the author in the area of Semantic Web technologies and discusses critically the results as well as their potential impact and future prospects of the Semantic Web technologies with respect to most recent advances in the relevant areas of computer science. Bibliography section which, for convenience, is divided into subsections related to particular topics (e.g. Semantic Web). Additionally the authors publications are in a separate section at the beginning of the bibliography. All bibliographical positions contain a short comment by the author on the content of the article or book and sometimes a motivation on why it was relevant for this thesis or to the particular subject. Hopefully this section could be useful for future readers of the thesis as a short overview of the most notable state of the art publications on given subject..

(23) 1.5. ACKNOWLEDGMENTS. 1.5. 9. Acknowledgments. In this section I would like to express my gratitude to numerous people with whom I had a chance of working during my PhD research. Most importantly I would like to thank my supervisor prof. Jacek Kitowski for his support and mentoring during my research as well as several colleagues at the Department of Computer Science at the University of Science and Technology and Academic Computer Centre CYFRONET AGH, in particular dr Łukasz Dutka and dr Renata Słota, as well as several students I had the opportunity to work with. Furthermore, I would like to thank several colleagues from other institutes with whom I had the pleasure of collaborating in the scope of various national and international research projects, including: • EU-IST FP5 Pellucid - A Framework for Experience Managament in e-Government • EU-IST FP6 K-Wf Grid - Knowledge-based workflow system for Grid applications (511385) • EU-IST FP6 Gredia - Grid enabled access to rich media content (34363) • EU-IST FP7 gSLM - Service Delivery & Service Level Management in Grid Infrastructures (261547) • EU-IST FP7 PRACE - Partnership for Advanced Computing in Europe (PRACE1IP RI-261557, PRACE-2IP RI-283493, PRACE-3IP RI-312763) • EU-IST FP7 PaaSage - Model-based Cloud Platform Upperware (317715) • European Defense Agency EUSAS - European Urban Simulation for Asymmetric Scenarios (A0676RTGC) • POIG PL-Grid - Polish Infrastructure for Supporting Computational Science in European Research Space (POIG.02.03.00-00-007/08-00) • POIG PLGrid Plus - Domain Oriented Services and Resources of PL-Grid Infrastructure for Supporting Polish Science in European Research Space (POIG.02.03.0000-096/10) • POIG IT-SOA - New information technologies for electronic economy and information society based on service-oriented architecture (POIG.01.03.01-00008/08).

(24)

(25) Grid. Computing,. Ser-. 2. vices and Cloud infrastructures. The purpose of this chapter is to present the current advancements in large scale distributed computing infrastructures that are in use today. Presentation starts from early history of supercomputing up to definition of Grid computing paradigm, followed by establishment of Service Oriented Architectures and their adoption in Grid computing and finally the emergence of Cloud computing concept. The emergence and gradual widespread adoption of these technologies over the last 10 years, have made it viable and necessary to improve the role of metadata in terms of enabling collaboration between parties (such as business organizations or research teams) through resource sharing.. 2.1. Evolution of Grid Computing. The idea of resource sharing and harnessing the power of large numbers of computational and storage resources available through large scale network had been present since the early 60’s, along with the development of packet switching networks and ARPANET [196]. The same author is also quoted in 1969 original UCLA press release announcing launch ARPANET for stating [211]: As of now, computer networks are still in their infancy, but as they grow up and become more sophisticated, we will probably see the spread of ’computer utilities’, which, like present electric and telephone utilities, will service individual homes and offices across the country..

(26) 12. 2.1. EVOLUTION OF GRID COMPUTING. With the progress in the network technology on the one hand, and the computational power of cluster and vector based systems on the other, an idea of metacomputer had been proposed in the early 1990s [207]. The metacomputer concept sketched the main problems and challenges which needed to be addressed before truly large scale scientific computations could be conducted. The development of metacomputer would consist of 3 stages: • 1st stage - Development of software enabling the user to transparently perform typical tasks of data access, computation execution and visualization • 2nd stage - Enabling spreading of a single application among the distributed computing infrastructure • 3rd stage - Deployment of high throughput national network infrastructure enabling seamless access to the sum of all computational and storage resources based on appropriate standards Based on the research in this area, by the end of the 1990’s, the concept of Grid has been coined [195]. Soon the ideas presented in this book had been elaborated into a full Grid architecture [190]. The most important novelty here was that in parallel with the proposed idea a system architecture and first implementation of a framework had been proposed. The main characteristic features of the proposed Grid architecture included: • Interoperable communication protocols enabling dynamic resource sharing, joining and leaving of parties and operation over heterogeneous platforms, programming languages and administrative domains, • Protocols defining how different components and applications running in a distributed system communicate and interact with each other, enabling dynamic changes in the resource-sharing policies, • Virtualization of the middleware components in the form of services. The Grid architecture has been defined using the hourglass model, similar to the TCP/IP stack, where each layer identified requirements for services and protocols necessary for fulfilling the Grid and Virtual Organization concept. The basic layers included (see Figure 2.1): • Fabric - this includes essentially the hardware and software infrastructure of the Grid, consisting of computational and storage resources, distributed file systems which are shared between the parties,.

(27) 2.1. EVOLUTION OF GRID COMPUTING. 13. Figure 2.1: Grid Layered Architecture [190] • Connectivity - this layer includes the core communication and security protocols within the infrastructure, in particular the security infrastructure should support single sign-on, delegation, integration with local security systems and user-based trust relationship management, • Resource - this layer defines the protocols build on top of the Connectivity layer enabling secure resource management in the distributed Grid system. These include mainly information protocols for accessing metadata about resources and management protocols which provide means for accessing and controlling resources, • Collective - the protocols in this layer enable coordinated resource sharing in large infrastructure and by several users simultaneously. These include such services as directory services, scheduling, monitoring, data management, workload management, service discovery, • Applications - this layer constitutes the actual user applications running in the Grid infrastructure.. Furthermore, in order to clarify the essential characteristic of the Grid, its authors created a 3 point checklist for Grid definition [186], where they define the Grid as a system which: • coordinates resources that are not subject to centralized control • using standard, open, general-purpose protocols and interfaces.

(28) 14. 2.1. EVOLUTION OF GRID COMPUTING. • to deliver nontrivial qualities of service The resources in the Grid system appear to the user in a virtualized form, i.e. the user does not have to deal with all the CPU’s, disk drives or network interfaces available in the infrastructure but can program the application using certain abstractions provided by the Grid middleware. The main abstractions typically available in the Grid middleware include: • Computational resources - the computational resources include the sum of the nodes of the infrastructure which have been designated for executing user jobs. The actual nodes can be either physical or virtual machines, however the important thing is that the user does not have to manually distribute parts of the application to different machines, but this functionality is automatically handled by the scheduler, • Storage - the storage resources represent the overall disk space where the user can store their data. Depending on the middleware the overall sum of the storage space can be virtualized through specialized file systems giving the user the illusion of single storage space, • Network - the network resources represent the means of communication between distributed parts of the infrastructure including physical high bandwidth links or virtual networks, • Hardware and software - additionally to the generic hardware and software resources, Grid can provide to its users custom devices or software packages, which could be too expensive for a given user to purchase, however limited access to them can be granted to a Virtual Organization, • Visualization - Grid computing opened new possibilities for scientific (protein folding, high energy physics, etc.) and industrial visualization (car prototyping, engineering) by providing large distributed computing power to be dedicated to particular simulation for a given period of time. This implies that resources shared between the parties can be administered locally by their organizations, and should be interoperable through standard protocols which allow integration of components developed in different technologies and which provide added value beyond the simple sum of its constituents. The authors of the Grid concept proposed the notion of Virtual Organization as a form of coordinated resource sharing between these organizations. An important aspect of the initial vision of the Grid was to enable creation of multi-institutional Virtual Organizations, which allow scientists all over the world to collaborate through sharing of resources, data and experience..

(29) 2.2. VIRTUAL ORGANIZATIONS. 2.2. 15. Virtual Organizations. As the complexity of both problems which computer users are dealing with and the complexity of the underlying computing infrastructure are constantly increasing, the middleware on which end user applications are deployed and provisioned must support several features enabling collaboration in distributed and transparent manner. This is true for several kinds of applications from scientific research, education, business or even entertainment. In fact, several such paradigms have emerged, sometimes even without being given a special name. These include such concepts as Virtual Organization, Virtual Enterprise, Virtual Community or Collaborative Networked Organization. These concepts, which are applied to particular modes of cooperation based on services provided by some IT infrastructure all have one thing in common they enable coordinated collaboration of distributed and heterogeneous entities through a set of capabilities provided by the IT infrastructure. This chapter discusses existing research in this domain, dealing with both scientific Virtual Organizations as well as more business oriented Virtual Enterprises. In particular, dynamic construction of Virtual Organizations (VO) based on knowledge and contracts negotiation is discussed in this chapter. Implementation of knowledge overcomes difficulty in organizational heterogeneity, making use of previous experience and organizational hierarchy. The hierarchy is reflected by low-level VO aspects (reflected in the IT infrastructure configuration such authorization rights and software integration) and high-level VO aspects (including business process, user management and other domain specific issues) usually with dynamical properties, defined by contracts, as a way of defining business collaboration, which can emerge at any time on demand. Virtual Organizations can be static or dynamic. Static Virtual Organizations are established ahead of time when the goal of the VO emerges, and the setup process usually requires long term preparations. Dynamic Virtual Organizations are characterized by rapid establishment in order to pursue a quickly emerging goal such as in case of distaster management or some business opportunity. Process of dynamic VO creation and management should be supported by a kind of framework for definition and operation of VOs emerging from groups of organizations. Creation of both dynamic and static Virtual Organizations requires several assumptions on the environment from which they are emerging. Ideally, all members must be ready to form or join a VO. This implies they have proper software infrastructure and policies which allow sharing and using data from third parties which are in the same Virtual Organization. A set of such organizations, which are ready to join or create a Virtual Organization within some particular environment are often referred to in the literature as Virtual Breeding Environments (VBE) [235]..

(30) 16. 2.2. VIRTUAL ORGANIZATIONS. 2.2.1 Models of Virtual Organizations The concept of Virtual Organization provides means for organizing dynamic collaborations between distributed entities. Although no single definition of VO exists, some general characteristics common to all of these definitions have been identified. These include Dematerialization, Delocalization, Asynchronization, Integrative atomization, Temporalization, NonInstitutionalization and Individualization [223]. A virtual organization (VO) comprises a set of individuals and/or institutions having direct access to computers, software, data, and other resources for collaborative problem-solving or other purposes. VO is a concept that supplies a context for operation of the Grid that can be used to associate users, their requests, and a set of resources. The sharing of resources in a VO is necessarily highly controlled, with resource providers and consumers defining clearly and carefully just what is shared, who is allowed to share, and the conditions under which sharing. Significant work has been done on VOs not only in the Grid community, but also in the software community. The latter has focused on algorithms for dynamic environments so that the VO and its participants can achieve their objectives. Another issue is the one of constituents of a VO. These can include both tangible and intangible resources. Especially the latter ones are often not given enough attention. These include knowledge, business processes, human capital, customer relationships, etc. The biggest value a company can bring to such collaboration is its knowledge. An important part of the paper [230] describes the need for embracing change that is the essence of each dynamic organization, and the VO must adapt to the changes in the source network. It is also important to remember that Virtual Organizations and Virtual Enterprises evolved in late 90’s from the area of systems integration. It states that it is important to distinguish Virtual Organization from simple eCommerce scenarios, that are B2C or even simple B2B buy-sell use cases. The VO should be understood as longer lasting and more complex collaboration of enterprises [219]. In this respect, it is important to remember that VOs in order to have a chance to be established need prior environment. This is referred to as Virtual organization Breeding Environment [230]. Authors of [224] have reviewed several existing methodologies and projects. The modeling approaches are classified according to 2 orthogonal dimensions, one specifying target user of the model (human or software) and the other specifying the goal of the model (understanding or enactment). It distinguishes 4 major groups of models: management models (high level requirements and use cases), management oriented process models (which can describe what abstract activities should be taken to achieve BLO), system requirements models (e.g. CIMOSA or GERAM) and finally enacted models (describe actual execution e.g. in the form of workflows of services). Authors criticize previous approaches to VO modeling in various projects for starting the effort.

(31) 2.2. VIRTUAL ORGANIZATIONS. 17. by defining detailed requirements or event enactment models without first analyzing the source network and describing the management processes thus reinventing the wheel over and over. The common problem in modeling was not the tools and the need to learn them but deciding how to describe the problems, what levels of detail to use, etc. The authors propose some guidelines for practical modeling and conclude with the need to make modeling frameworks and tools simpler, allowing distributed modeling and definition of a VO. Paper [222] proposes a VO reference model which was extrapolated from over 100 research papers on the subject. Authors introduced 3 types of VO topologies that are most common in practice: supply-chain, star and peer-to-peer, and claim that all of analyzed projects could be categorized to one of these models. The analysis done in [228] presents requirements and propose an architecture for a Virtual organization Breeding Environment management system for dynamic Virtual Organizations. The deliverable details all necessary steps in each of the phase of VBE life-cycle, for instance stresses the importance of ontology adaptation for particular VBE in creation phase. The VMS (Virtual organization Management System) implementation is based on Service Oriented Architecture, and conforms to 3-tier architecture including: VBE Portal, Business Logic Layer and Data Layer. The data layer is based on relational databases. EU-IST ECOLEAD project deliverable [229] concentrates on issues related with VO management. They divide the management of a VO into phases aligned with the usual VO lifecycle: Initiating, Planning, Executing, Controlling and Closing. The authors list several approaches to VO management, including Multi-organizational project management, Encouragement approach, Selforganizing approach, Time-dominated VOs and Supply-chain management. The deliverable describes the vision of an ideal VO in a business context, that means a VO which is based on some general characteristics and requirements which VO management systems will have to handle if considered for a real application in business settings. The authors also describe the functions and services required for managing a VO, which in general means that a VO is proceeding towards its objectives. Definition of these objectives is not however a part of VO management and must be predefined. The functions needed are split according to the phase of VO lifecycle in which they should be available (e.g. monitoring in VO operation, replacing partners in VO evolution). The architecture proposed in this deliverable is called RAVE (Real-Time Activity monitoring for Virtual Enterprise). It assumes real-time event based monitoring of events in the VO. The VO is managed by a central VOM server and clients can connect to it. The important part of the RAVE system is VO contract monitoring, that is assuring that all elements of VO (e.g. Web Service) operate in accordance to the statements from the contract. According to authors of [218] VO Modeling Framework is a set of tools, guidelines.

(32) 18. 2.2. VIRTUAL ORGANIZATIONS. and examples that should enable humans in VO creation. Authors discuss several approaches to VO inception. In the beginning most approaches were manual and involved direct coordination and interactions between human partners. Another way emerged with the evolution of multi-agent systems theory - which naturally supports the goal-oriented distributed co operations. With the Web Service a service market approach has been noticed, which assumes a set of service providers and service consumers, and that there is a way for them to find one another. Some have also proposed to use the results from optimization theory to create VO from available resources (potential partners in the system) that would minimize some criteria. The authors mention that in order to set up dynamically a VO some partners must be ready and prepared to join the VO. This source network is referred to as VBE (Virtual organization Breeding Environment). When the elements of a potential VO are not organizations but single persons, these are referred to as PVC (Professional Virtual Communities) and the equivalent for VO is VT (Virtual Team). The document presents in details the steps and issues related with VO creation. The stages include (assuming prior existence of VBE): Collaboration opportunity identification, Draft VO planning, Partners search and selection, VO composition and negotiation, Detailed VO planning and VO set up.. 2.2.2 Contracts in Virtual Organizations The authors of [232] describe requirements for automating the contract management in a VO. They identify 3 kinds of contracts in a VO: business contract, ICT contract and ASP (Application Service Provider) contract. In the context of the legality of a VO, it could be, in theory at least, registered as a legal entity or not. If it isn’t its contract is defined by bilateral agreements between the respective partners expressed in the form of contracts. The ICT contract involves the client and the participants of the VO. They include, often repeating in more technical form, the statements of the business contract such as security or property rights. More importantly this contract should explicitly state what information and meta-information belongs to who and who can access what. Tools are mentioned which enable on-line contract negotiation (e.g. Virtual Negotiation Room). The ASP contracts are a way for VOs to seem as single legal entities to their clients by outsourcing the provision of their functionality to a third party, with which the client negotiates. The ASP has a separate contract with each of the VO participants. Discussion lead in [233] tries to formalize a definition of contract based multiagent Virtual Organization. Authors define 4 key properties of VOs: Autonomy, Heterogeneity, Dynamism and Structure. They use terminology from agent-based systems, e.g. they refer to the VO itself as an agent. The contract is defined as a set of commitments, goals and agents in some context. The paper introduces a formal.

(33) 2.2. VIRTUAL ORGANIZATIONS. 19. definition of a hierarchical VO with a set of agents (which can be VOs themselves), policies, goals and commitments. The VO is then a set of bilateral contracts between agents in a VO, and can be more easily defined in a distributed setting. For example for 3 partners and 2 contracts A → B and B → C, A and C don’t even need to know about each other. Further, in [225] authors present a web-Pilarcos J2EE based agent framework for managing contract based Virtual Organizations. The contract itself is an object (J2EE EntityBean) and can be in several states such as In-negotiation, Terminated etc. The proposed solution is not based on ontologies, however the metadata reasoning is mentioned briefly. The proposed architecture has many different components - which might make it hard for integration with custom systems. The paper discusses the basic requirements for a VO contract such as modeling of service behavior, communication services and some non-functional properties such as QoS. It also discusses the operation of VOs, the need for monitoring of security and SLAs for ensuring proper QoS and the evolution of VOs. The last issue is addressed through a concept of epochs, which divide the timeline of a VO to periods between the changes in source network model (e.g. between partner change). 2.2.3 Virtual Organizations in Grid Computing Modern scientific and business applications are becoming increasingly complex and they require that the underlying Grid middleware is properly configured. These applications often include workflow type problems, interactive collaboration systems or ones with real-time constraints.. VOMS VOMS (Virtual Organization Membership Service) [217] is a system allowing user management within Virtual Organizations based on their membership and roles. VOMS had been originally developed within the European Data Grid project, and currently it’s possible to easily integrate it with the Globus MyProxy certificate management system. This allows automatic extension of proxy certificates with VOMS attributes which are placed in the extension section as specified by the X.509 standard. The information contained in the extension contains the name of the VO on behalf of which the certificate was issued as well as users roles in this VO. This allows for push-based style of authorization, where the user credentials are passed to the service along with the request and can be used to verify users authorization rights in particular context. The main problem related to VOMS is its relative simplicity and very limited.

(34) 20. 2.3. EXISTING GRID ENVIRONMENTS. Figure 2.2: Virtual Organizations in PL-Grid, Polish national Grid infrastructure. capabilities for defining security policies as well as handling advanced membership scenarios, which are only limited to groups and roles. UVOS While VOMS is typically used in Globus based Grid infrastructures, UNICORE has a central component called Unicore User Database, which manages basic information about users and their roles. This component is largely equivalent to the VOMS component as it allows to map users IDs (in the form of X.509 Distinguished Names) to roles and groups. One important feature which VOMS lacks in comparison is support for XACML (eXtensible Access Control Markup Language) which allows for complex resource access authorization policies. In order to extend UNICORE with support for Virtual Organizations a separate component was developed called UVOS (Unicore Virtual Organization System) [173].. 2.3. Existing Grid Environments. Since the inception of the Grid computing concepts, several tools and Grid frameworks have been developed and adopted by the scientific and commercial communities. The most notable include Globus Toolkit and Unicore, and they are discussed in more detail in the following sections..

(35) 2.3. EXISTING GRID ENVIRONMENTS. 21. 2.3.1 Globus Toolkit Globus Toolkit evolved as a major Grid middleware framework since 1997 [188]. The first major release which was used by several Grid infrastructures and distributed computing projects was Globus 2. It based on a set of low level components addressing several major aspects of Grid infrastructure such as resource allocation, data management, security and information services [186]. The major components included in this version of the framework included: • Globus Resource Allocation Manager (GRAM) - this component provided a single unified API for remote job submission providing convenient abstraction over local and often heterogeneous batch systems [182], • Global Access To Secondary Storage (GASS) - is an API and implementation of wide area data management system, allowing users to open remote files from their jobs while handling at the same time several policies for data movement and caching [175], • Monitoring and Discovery Service (MDS) - is a distributed information service collecting and providing up to date information about resources available in different sites, such as number of machines, their configuration, available disk space, etc., [185] • Grid Security Infrastructure (GSI) - is an architecture providing secure communication, authentication and authorization based on X.509 certificates including single sign-on and delegation through temporary proxy certificates[189], • GridFTP - is a protocol based on standard FTP (File Transfer Protocol), which provides means for a reliable and high throughput transfer of large chunks of data in distributed environment, providing such features as integration with GSI infrastructure, third-party transfer, partial file transfer and TCP level transfer optimization [178]. Over the last decade the Globus Toolkit evolved from a low level set of services to a more interoperable Web Service architecture based around the concept of Web Service Resource Framework (WSRF) [187]. The latest version as of this writing, Globus Toolkit 4 and above, include a variety of components covering major use cases related to performing large scale computational experiments using distributed high performance computing resources. The overall architecture of the Globus toolkit is presented in the Figure 2.3, which shows how the legacy (or pre-WS) component cooperate with the novel WS based components in order to enhance the interoperability of the framework with thirdparty infrastructures or tools..

(36) 22. 2.3. EXISTING GRID ENVIRONMENTS. Figure 2.3: A high level view of Globus Toolkit components [187]. 2.3.2 UNICORE UNICORE (Uniform Interface to Computing Resources) is a Grid middleware stack which started as a German research project in 1997 [205], and evolved into a fully functional Grid middleware [209]. The overall architecture of the framework is presented in Figure 2.4. The framework is divided into layers composed of client level components, service level components and system layer components. The first layer provides client tools including very popular and capable UNICORE Rich Client, an Eclipse based graphical user interface tool enabling easy use of most typical Grid use cases such as job submission or file management. The service layer provides SOA compatible components including job management functionality (XNJS), user management (XUUDB), workflow engine and Virtual Organization management service (UVOS). Finally the bottom layer provides Target System Interface (TSI) which allows integration with particular resource managers and batch systems on target resources. Similarly to Globus, UNICORE uses the latest SOA standards including WSRF and WS-I. One notable component with respect to this thesis is the Virtual Organization management component (UVOS) which is a flexible tool for managing users and their credentials grouped into various Virtual Organizations during their usage of the system, with authorization handled through both SAML (Security Assertion Markup.

(37) 2.3. EXISTING GRID ENVIRONMENTS. 23. Figure 2.4: UNICORE architecture [205]. Language) and X.509 certificates [173].. 2.3.3 gLite gLite (Lightweight Middleware for Grid Computing) [197] is another example of a Grid framework developed largely in Europe within the frame of the EGEE (Enabling Grids for E-sciencE) series of projects. The main goal of this framework is to provide users with consistent interface and set of services wrapping access to other Grid components such as Condor batch scheduler or Globus. The overall architecture of the system is presented in Figure 2.5, which shows 5 main service groups provided by the gLite middleware including security services based on GSI and WS-Security, information and monitoring services, workload and job management services based on the JDL (Job Description Language) and data services based on file management components providing mapping between logical file names (LFN) and actual distributed replicas of that file. Currently gLite is a part of a larger framework called EMI (European Middleware.

(38) 24. 2.4. MOST IMPORTANT STANDARDS IN GRID COMPUTING Programming the GridAND withTECHNOLOGIES gLite. 35. Fig. 1. gLite Services Figure 2.5: gLite architecture [197]. cess Control Lists (ACLs). Application specific metadata is tificates enable single sign-on. TLS, GSI, and WS-Security expected not toInitiative) be stored in the basic gLite services but in transport or message-level security protocols ensure integ[167]. rity, authenticity and (optionally) confidentiality. The EU application specific metadata catalogs. All the data management services act on single files or GridPMA establishes a common set of trust anchors for collections of files. To the user of the EGEE data services the authentication infrastructure. 2.3.4 QosCosGrid Attribute authorities enable VO managed access conthe abstraction that is being presented is that of a global file system. A client user application may look like a Unix shell trol, while policy assertion services enable the consolidaQoSCosGrid [177]this is a virtual highly integrated developing Grid applications, tion andforcentral administration of common policy. An auwhich can seamlessly navigate file system,framework including support etc. for advanced resource reservation, scheduling and monitoring. was thorization framework enables localItcollection, arbitration, listing files, changing directories, customization and reasoning on policies from different Note, that the gLite architecture does not in general imdeveloped within the framework of PL-Grid and PLGrid Plus projects as an alternative administrative domains, as well as integration with service pose specific deployment scenarios (i.e. how many instances to existing Grid middleware solutions to support Polish research community with of a certain service are available to a user, if a service is containers and legacy services. more feasible means for accessing distributed GridThe infrastructure. QCG supports functionalities described in both EGEE security archireplicated or distributed, etc.). Most importantly, service tecture are in most cases embedded in the service container command line access as well as IDE (Integrated Development Environment) developed instances may serve multiple VOs which will facilitate the or in the application itself, for performance reasons – they scalability and using performance the Grid system although Eclipse of platform. are not rendered as separate Web Services. a VO may require its own instance as well. It is important that the security architecture used by In the remainder of this paper we focus in particular on EGEE allows for basic interoperability with other Grid the security, monitoring, job management, and data man2.4 Most important standards and technologies in Grid computing projects. agement service, as these are the services a typical user deployments or middleware Figure 2 depicts an overview on how the components in mostly interacts with. Details on the internals of the gLite the security interact in the following typical Although Grid computing frameworks severalarchitecture standards and technologies services are beyond the several scope of this paper and can be exists, request flow: found in the gLite architecture document [26]. have emerged over the last 10 years which provide some functionality which can be 1. The user 2presents obtains brief Grid overview credentials used in a interoperable fashion between them. This section of from a credential3 store, and the necessary tokens that assert the user’s 3. SECURITY key Grid technologies and standards relevant for this thesis.. The EGEE security architecture [15] is based on well established work in the Grid community. On the authentication side a credential storage ensures proper security of (user-held) credentials while proxy cer-. 2. We use the word “user” in wide terms: for instance, it also encompasses the software agents that act on the user’s behalf.. 3. A “resource” in Web Services terminology is practically anything that is managed by a service: it can be a compute element, a file transfer client, an information index etc..

(39) 2.4. MOST IMPORTANT STANDARDS AND TECHNOLOGIES IN GRID COMPUTING. 25. 2.4.1 OGSA Open Grid Services Architecture (OGSA) is a high level specification providing a set of requirements and capabilities for these requirements which should be met by a proper Grid infrastructure. The requirements cover both functional and non-functional aspects of the possible use cases in a Grid environment including scientific and business scenarios. The requirements include such aspects typical to Grid environments such as heterogeneous resources, site autonomy, resource virtualization, standardization, global namespace and metadata services, monitoring, Service Level Agreement, job scheduling and advanced security and data management functionalities, among others [186]. Requirements. Capabilities. Interoperability and Support for Dynamic and Heterogeneous Environments Resource Sharing Across Organizations Global name space, Metadata services, Site autonomy, Resource usage data Optimization Quality of Service (QoS) Assurance. Infrastructure Services Execution Management Services Data Services. Job Execution. Resource Management Services. Data Services. Security Services. Security. Self-Management Services. Administrative Cost Reduction. Information Services. Scalability Availability Ease of Use and Extensibility. Figure 2.6: Overview of OGSA requirements and capabilities Based on these requirements the OGSA model defines a set of services, called capabilities which should be provided by the Grid infrastructure in order to meet certain requirements 2.6. The services providing the proposed capabilities are set in Grid framework between the base resources such as hardware and operating systems and below the end user applications and domain specific Grid applications. An important thing to note is that the services proposed by OGSA are largely orthogonal.

(40) 26. 2.4. MOST IMPORTANT STANDARDS AND TECHNOLOGIES IN GRID COMPUTING. and every Grid infrastructure can provide only selected services, depending on the particular needs. Although in theory the architecture is transport agnostic, the services in OGSA are envisioned in practice to be compliant with the Web Services standards such as WSDL and SOAP. Along with the OGSA standard, an infrastructure layer called OGSI (Open Grid Services Infrastructure) was proposed [210], which provided a notion of stateful Grid resources into otherwise stateless Web Service protocols. However this standard was soon superseded by the WSRF (Web Service Resource Framework) which was integrated into the version 4 of Globus Toolkit [194].. 2.4.2 WS-Resource Framework Due to several problems and critical reception in the community of the OGSI standard, a new standard had been proposed in order to deal with the stateless nature of Web Services. Several industrial organizations, mostly Globus Alliance and IBM, have proposed a WS-Resource Framework, whose goal was to address the main drawbacks of the OGSI such as notification tightly coupled with the data, poor compatibility with existing XML tools and requirement for Web Services containers for proper construction and destruction of services. The new proposal extends the functionality of regular Web Services with such aspects as creation, inspection, addressing, management and notification of Web Service stateful resources. The main specifications within this framework includes [172]: • WS-ResourceProperties - describe define the resources in terms of Web Service properties, • WS-ResourceLifetime - management of resource lifetime, • WS-RenewableReferences - management of references to resources, • WS-ServiceGroup - grouping of Web Services into collections, • WS-BaseFaults - error management, • WS-Notification - these specification provides means for subscribing and notifying clients on various events related to the resource, for instance value change. The WSRF is not dedicated only to Grid environments, and several non-Grid implementations exist [193]..