Index of /rozprawy2/11556

Pełen tekst

(1)AGH University of Science and Technology Faculty of Computer Science, Electronics and Telecommunications Department of Computer Science. Doctoral Thesis. Transparent Data Access in Federated Computational Infrastructures Author: mgr in˙z. Michal Wrzeszcz. Supervisor: Prof. dr hab. in˙z. Jacek Kitowski Co-supervisor: dr hab. in˙z. Renata Slota. Kraków, Poland April 2019.

(2) Akademia Górniczo-Hutnicza im. Stanislawa Staszica w Krakowie Wydzial Informatyki, Elektroniki i Telekomunikacji Katedra Informatyki. Rozprawa doktorska. Transparentny Dostep do Danych w Sfederalizowanych Infrastrukturach Obliczeniowych Autor: mgr in˙z. Michal Wrzeszcz. Promotor: Prof. dr hab. in˙z. Jacek Kitowski Promotor pomocniczy: dr hab. in˙z. Renata Slota. Kraków, Kwiecie´ n 2019.

(3) I would like to thank Professor Jacek Kitowski, my PhD supervisor for his guidance, support and valuable advices that helped shape this thesis. My sincere thanks also goes to Dr. hab. Renata Slota for suggestions and fruitful discussions during writing this dissertation. I am also grateful to ACC Cyfronet AGH for provision of resources required to verify the thesis. Among my colleagues from ACC Cyfronet AGH and Department of Computer Science, there is one person I want to especially thank and express my gratitude to, the leader of Onedata team Dr. Lukasz Dutka who supported me with his knowledge and passion. I also owe thanks to my Colleagues from Onedata team for great time working together and Piotr Nowakowski for consulting text of the thesis. Last but not the least, I would like to thank my family: my wife and my daughter for patience and forbearance..

(4) Abstract Current scientific problems require strong support from data access and management tools, especially in terms of data processing performance and ease of access. However, when analysing elements that influence user operations it is impossible to choose a single set of mutually nonexclusive features that satisfy all the requirements of data access stakeholders. Thus, the author has decided to study how a large-scale data access system should operate in order to meet the needs of a multiorganizational community. The author has identified context, represented by metadata, as a key aspect of the solution. On this basis the author postulates that context awareness enables data to be provisioned to users in a transparent manner while maintaining quality of access. However, along with the growth of the environment in terms of round-trip times, metadata management becomes challenging due to access and/or management overheads, often resulting in bottlenecks. Thus, the author has identified and classified contextual metadata, taking into account consistency and synchronization models, utilizing BASE (Basic Availability, Soft-state, Eventual consistency) rather than ACID (Atomic, Consistent, Isolated, Durable) whenever possible. This dissertation describes steps undertaken in order to validate the author’s thesis, starting with analysis of the requirements of federated computational infrastructure stakeholders and shortcomings of existing data access tools. The core element of the thesis is a description of the Model of Transparent Data Access with Context Awareness (MACAS), designed to accommodate dynamic changes of factors which affect data access in order to provide the desired access characteristics to specific groups. To solve this complex task, the model introduces layers and cross-cutting concerns which cover different aspects of data access, such as interactions with diverse storage resources, users’ interactions with the data access system, coordination of execution of multiple operations to utilize more than one storage system, efficient utilization of network resources, cooperation of resource providers and distribution of the environment. The author also presents an implementation of proposed model that focuses on the ability to process large amounts of metadata, along with notifications which enable broad provisioning of up-to-date context information. The dissertation is concluded by a description of tests carried out in a federated environment, without any assumptions regarding the providers’ mutual relationships. These tests validate the model’s quality as well as its capability for adaptation to nonfederated environments..

(5) Streszczenie Aktualne problemy naukowe wymagają odpowiednich narzędzi zapewniających nie tylko wydajny dostęp do danych ale i łatwe zarządzanie danymi. Analizując elementy, które mają wpływ na operacje wykonywane przez użytkownika, nie jest jednak możliwe wybranie jednego zestawu funkcjonalności, który satysfakcjonuje wszystkich zainteresowanych dostępem do danych. W związku z tym autor zaproponował i poddał badaniom system dostępu do danych spełniający wymagania społeczności użytkowników zrzeszonych w wielu niezależnych organizacjach. Autor zidentyfikował kontekst, reprezentowany jako metadane, jako kluczowy element rozwiązania, formułując tezę, że znajomość kontekstu umożliwia transparentne dostarczanie danych użytkownikom, utrzymując przy tym jakość dostępu. Jednak wraz z rozrastaniem się infrastruktury, zarządzanie metadanymi staje się coraz bardziej wymagające z powodu narzutów na synchronizację i/lub opóźnień w dostępie, które mogą doprowadzić do powstania wąskiego gardła w systemie dostępu do danych. W związku z tym autor zidentyfikował metadane opisujące kontekst i sklasyfikował je na podstawie modeli spójności i synchronizacji, w celu zapewnienia dostępności i efektywności kosztem transakcyjnego, atomicznego przetwarzania, tam gdzie jest to możliwe. Rozprawa zawiera opis etapów realizowanych w celu weryfikacji sformuowanej w rozprawie tezy, zaczynając od analizy wymagań oraz niedoskonałości istniejących narzędzi zapewniających dostęp do danych. Głównym osiągnięciem pracy jest model transparentnego dostępu do danych z wykorzystaniem kontekstu (ang. Model of Transparent Data Access with Context Awareness MACAS), który umożliwia dynamiczne zmiany parametrów wpływających na dostęp do danych, aby zapewnić pożądaną charakterystykę dostępu do danych przez poszczególnych użytkowników. Aby rozwiązać tak złożone zadanie, model składa się z warstw obejmujących różne aspekty dostępu do danych, takie jak interakcja z różnymi systemami składowania danych, interakcja użytkowników z systemem dostępu do danych, koordynacja wykonania wielu operacji w celu wykorzystania więcej niż jednego systemu składowania danych, wydajne wykorzystanie zasobów sieciowych, współpraca organizacji dostarczających zasoby dyskowe i obliczeniowa oraz rozproszenie środowiska. Autor przedstawia także implementację proponowanego modelu, która koncentruje się na możliwości przetwarzania dużej ilości metadanych oraz powiadomień, które umożliwiają dostarczanie szerokich i aktualnych informacji kontekstowych. Na zakończenie prezentowane są testy w środowisku sfederowanym, które udowadniają jakość systemu utworzonego na bazie modelu, a także zdolność dostosowania modelu do niesfederowanych środowisk..

(6) Contents. 1. 2. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 1. 1.1 1.2 1.3 1.4 1.5 1.6. 3 4 5 7 7 8. 2.2 2.3 2.4 2.5 2.6. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. Computational Environments . . . . . . . . . . . . . . . . . . . . . .. 10. 2.1.1. Grid Computing .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 10. 2.1.2. Cloud Computing. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 11. Typical Grid and Cloud Data Access Tools . . . . Tools for Anytime/Anyplace Data Access . . . . Tools for Distributed Data Processing . . . . . Tools for Unified View of Multiorganizational Data Summary . . . . . . . . . . . . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. 12 13 14 16 17. MACAS - Model of Transparent Data Access with Context Awareness. . . 20 3.1 3.2 3.3. 3.4. 3.5. 4. . . . . . .. Background Survey . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.1. 3. Motivation . . . . . . . . . . . . . Thesis Statement and Research Objective . Thesis Contribution . . . . . . . . . Note on Participation in Research Projects. Thesis Structure. . . . . . . . . . . Definitions of Terms . . . . . . . . .. Data Access Stakeholders . . . . . . . . . . . . . . . . . . . . . . . Basis of MACAS . . . . . . . . . . . . . . . . . . . . . . . . . . Context Modelling in MACAS . . . . . . . . . . . . . . . . . . . . .. 21 23 25. 3.3.1. Types of Metadata .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 26. 3.3.2. Description of Metadata. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 27. 3.3.3. Classification of Metadata .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 29. 3.3.4. Metadata Consistency and Synchronization Models .. .. .. .. .. .. .. .. .. .. .. .. .. .. 32. Model Description . . . . . . . . . . . . . . . . . . . . . . . . . .. 33. 3.4.1. Description of MACAS Layers and Concerns .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 34. 3.4.2. MACAS Algorithm .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 36. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 38. .. .. .. .. .. .. .. .. Architecture and Selected Aspects of Implementation . . . . . . . . . . 40 4.1. Overall Architecture of the System. . . . . . . . . . . . . . . . . . . . .. 40. 4.1.1. Metadata Distribution .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 45. 4.1.2. Handling Metadata Updates .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 48. 4.1.3. Propagation Delay for Metadata Changes and its Consequences .. .. .. .. .. .. .. .. .. .. 50.

(7) CONTENTS. v. 4.2. 51. 4.3. 5. 4.2.1. DMC Core .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 52. 4.2.2. DMC Modules. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 55. 4.2.3. Request Handling and Load Balancing .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 57. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 59. Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 61 5.1. 5.2. 5.3 5.4 5.5 5.6. 6. Architecture of Data Management Component . . . . . . . . . . . . . . .. DMC Core Tests. . . . . . . . . . . . . . . . . . . . . . . . . . .. 62. 5.1.1. Evaluation of Request Routing and Processing .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 62. 5.1.2. Metadata Access Evaluation .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 64. 5.1.3. Reliability Evaluation. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 66. Performance Evaluation of Integrated System. . . . . . . . . . . . . . . .. 67. .. .. .. .. .. 5.2.1. Overhead Evaluation. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 68. 5.2.2. Evaluation of Scalability and System Limits .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 70. 5.2.3. Evaluation of Overhead in a Multi-DMC Environment .. .. .. .. .. .. .. .. .. .. .. .. .. 71. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. 73 76 78 78. Datachunk Management Evaluation . . . . . . Context Awareness Evaluation . . . . . . . . Contribution of Context Awareness to Experiments Evaluation Summary . . . . . . . . . . . .. . . . .. Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . 81 6.1 6.2 6.3 6.4. Summary . . . . . Research Contribution Range of Applications Future Work . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. 81 83 83 84. Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 Author’s Bibliometric Data . . . . . . . . . . . . . . . . . . . . . . . . . 94 Author’s Publications. . . . . . . . . . . . . . . . . . . . . . . . . . . . 95.

(8) List of Figures. 1.1 1.2 1.3 1.4. 2.1. View of collaboration at scale . . . . . . . . . . . . . . . . . . Influence of metadata on data access . . . . . . . . . . . . . . Proposed evolution of data access model . . . . . . . . . . . . Contribution of the author (green) and collaborative tasks in was involved (black) . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . which . . . .. . . . . . . . . . . . . . . . . . . . . . the author . . . . . . .. 1 3 6 6. Drawbacks of existing tools for federated computational environments (red - drawbacks, green - advantages offset by other drawbacks) . . . . . . . . . . . . . . . .. 18. 3.1 3.2 3.3 3.4 3.5 3.6 3.7. Scheme of the data access system . . . . . . . . . . . . . . . . . . . . . . . . . Stakeholders’ relation to data access in federated computational environments Basic metadata used during data access . . . . . . . . . . . . . . . . . . . . . Example of datachunk replication . . . . . . . . . . . . . . . . . . . . . . . . . Main metadata dependencies . . . . . . . . . . . . . . . . . . . . . . . . . . . Model of Transparent Data Access with Context Awareness . . . . . . . . . . Algorithmic representation of MACAS . . . . . . . . . . . . . . . . . . . . . .. 21 21 24 26 27 34 37. 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 4.10 4.11 4.12 4.13. Overall architecture of the system . . . . . . . . . . . . . . . . . . . . FUSE client (FClient) concept . . . . . . . . . . . . . . . . . . . . . . Sample FClient pseudocode . . . . . . . . . . . . . . . . . . . . . . . . Basic pseudocode for handling FClient requests . . . . . . . . . . . . . Pseudocode for FClient datachunk synchronization request handling . Pseudocodes for updates of metadata describing datachunks . . . . . . Pseudocodes for metadata updates following merger and invalidation of Metadata stores and caches . . . . . . . . . . . . . . . . . . . . . . . . Sample flow between metadata stores and metadata caches . . . . . . Conflict resolution pseudocode . . . . . . . . . . . . . . . . . . . . . . Deployment of DMC Core . . . . . . . . . . . . . . . . . . . . . . . . . DMC modules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Request flow – different modes with different features . . . . . . . . .. 5.1 5.2. Normalized throughput with similar load on all DMC cluster nodes Normalized throughput with DMC cluster nodes divided into two different load . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fragments of logs from reliability tests . . . . . . . . . . . . . . . . Total aggregated throughput and CPU usage . . . . . . . . . . . . Data access throughput for local and shared datasets . . . . . . . .. 5.3 5.4 5.5. . . . . . . .. . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . datachunks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . groups with . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 41 41 42 42 44 45 46 46 49 50 52 55 59 62 63 68 71 72.

(9) vii. LIST OF FIGURES. 5.6 5.7 5.8 5.9. Changing distribution of file fragments among DMCs . Test environment for comparing management policies Context awareness test environment . . . . . . . . . . Results of selected steps of the context awareness test. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. 73 73 77 79. 6.1 6.2. Relation of issues connected to transparent data access . . . . . . . . . . . . . . . Author’s individual achievements (green), collaborative achievements (black) and tasks in which the author was not involved (orange) . . . . . . . . . . . . . . . .. 82 83.

(10) List of Tables. 1.1. Data storage, access and management levels . . . . . . . . . . . . . . . . . . . . .. 3. 2.1 2.2. Features of data access solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . Existing solution characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 17 19. 3.1 3.2 3.3 3.4. Features of the data access system expected by stakeholders Abbreviations for types of metadata used in equations . . . Classes of metadata with abbreviations used in equations . Metadata classes . . . . . . . . . . . . . . . . . . . . . . . .. . . . .. 23 28 29 33. 4.1 4.2 4.3. Implementation assumptions for MACAS . . . . . . . . . . . . . . . . . . . . . . Aggregation of events and changes . . . . . . . . . . . . . . . . . . . . . . . . . . Implementation of MACAS layers and concerns by DMC modules . . . . . . . .. 43 50 56. Request handling time and characteristics of request processing modes . . . . . . Test configurations for metadata access . . . . . . . . . . . . . . . . . . . . . . . Average metadata access times for different configurations and computing environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4 Number of memory slots occupied at the end of the test for different configurations and computing environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5 Results of reliability tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.6 Description of overhead tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.7 Throughput of a system implementing the MACAS model in comparison with direct access . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.8 Total aggregated throughput and number of operations per second . . . . . . . . 5.9 Comparison of management policies . . . . . . . . . . . . . . . . . . . . . . . . . 5.10 Types of context awareness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 64 65. 5.1 5.2 5.3. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. 65 66 67 69 69 71 75 80.

(11) 1 Introduction. With the fast grow of the digital universe, data access and processing at a global scale are at the centre of scientific and commercial interests. This follows from the ever increasing scale of research problems which call for wide-ranging collaboration between groups of researchers making use of geographically distributed, heterogeneous data sources (cf. Figure 1.1). The problems – such as Data Science [35; 57], Big Data [26; 53], the 4th Paradigm [18] and Science 2.0 [47], each represented by a set of activities performed worldwide – require strong support from data access and management tools, which must evolve to meet demands for data processing performance supported by the available storage resources, as well as ease of use. Extracting knowledge or generating insight from data provides an understanding of the contemporary, scientific, commercial and social challenges. A recognizable feature of modern data access and management systems is the tendency to cross organizational boundaries, resulting in work on the forefront of science and engineering, such as for example processing of astronomy data [44], or sharing and analyzing the results of large-scale experiments and simulations (e.g. Human Brain Project [24] or Worldwide LHC Computing Grid [12]). New data storage and management paradigms are foreseen, e.g., the concept of data lakes [49] for maintaining and sharing different types of data. Modern science introduces many problems which require collaborative work and call for substantial resources. Likewise, the business world acknowledges the increasing role of efficient data. Figure 1.1: View of collaboration at scale.

(12) 2. management for analyzing constantly growing volumes of data in order to remain competitive on the market. The need for e-infrastructures which deal with open data, facilitating easy access and sharing in organizationally distributed environments, has already been acknowledged, giving rise to various projects and initiatives, e.g. [81; 72]. Nevertheless, development of such systems clearly lags behind expectations when ease of use, effectiveness, transparency of data access and heterogeneity of resources are concerned. Collaboration between distributed groups of workers calls for sophisticated systems, which, in turn, requires a set of specific problems to be solved separately to meet collaboration requirements. Due to the complexity of such systems, their development is usually performed by teams of architects and developers with well-defined roles and activities, inspired by specific use cases and end users. In formal terms, data access and management can be analyzed at several levels, from personal data to globally distributed shared data. This leads to various cooperation problems which need to be overcome by data access and management tools. The simplest case involves access to local data, i.e., to data stored on direct attached storage (DAS). The user can control all activities; the only problem is to provide device drivers and solutions that use the hardware in an optimal fashion, such as the stripping algorithm for the SSD storage array [39], tuning the storage system for a particular type of usage (e.g., archiving [23]) or balancing between several levels of a hierarchical storage system. The next level involves provisioning data access to a group of users working for a single organization, e.g., network attached storage (NAS). Problems encountered in this case pertain to possible network failures, higher latency [37], or simultaneous work by multiple users, impacting quality of service parameters [29; 30; 36]. Storage systems must be able to maintain the required QoS and cost effectiveness on the organizational level, assisted by request scheduling and resource allocation algorithms [43; 32]. The resources offered by a single organization may prove insufficient for users who require large amounts of storage and computing power to process data streams produced continuously by experiments, or make use of large amounts of information gathered in existing datasets. For such users, resource providers create federated organizations (FO) – or federations for short – i.e., groups which agree to collaborate in order to simplify access to resources which belong to multiple organizations, often defining a storage attached network (SAN) and detailed common rules of cooperation and resource sharing [42]. Nevertheless, many problems related to federated data remain unresolved. Computing Grids [2] and Virtual Organizations (VOs) [1; 14] address issues related to decentralized management by organizations that use different policies and make autonomous decisions according to local requirements. Nevertheless, further work is required to improve efficiency and convenience of data access, as well as to ensure cost-effective data management. The most complex case is the use of data provided by several nonfederated organizations (NFOs), i.e., organizations that do not have any cooperation agreement in place. In this case, challenges related to trust and lack of standards appear. As there is no bond between NFOs, the exchange of information concerning users and their data is difficult since each NFO may apply its own authentication mechanism. Table 1.1 summarizes the presented levels of data storage, access and management. At each level, a cumulative increase of difficulty is observed. One of the important aspects of a solution which addresses such problems pertains to metadata (see Figure 1.2) that can be managed automatically or manually by the user. Such metadata includes user-specific information (e.g.,.

(13) 3. 1.1. MOTIVATION. Figure 1.2: Influence of metadata on data access Table 1.1: Data storage, access and management levels Environment Description. Level. Direct Attached Storage. Local. Network Attached Storage for a group of users Resources offered by closely collaborating organizations Resources offered by unrelated organizations. Organization Federated Organization (FO) Nonferedated Organization (NFO). Sample problems Provision of device drivers to maximize hardware performance Provision of device drivers for distributed environment to minimize negative impact on simultaneous use Distributed management and low-level data access policies No trust between organizations, lack of standards, local user accounts. access control) along with storage-specific information (e.g., location of data replicas) [25; 9; 5; 6]. Metadata can also be used to describe the context of data access, e.g., system component load. The more contextual information is taken into account by management algorithms, the higher quality is provided to the users. However, management of large amounts of metadata may prove difficult for many users if it is not automatically handled by data access system. Moreover, along with the growth of the environment in terms of round-trip times, metadata management becomes more challenging because of possible access and/or management overheads. Research indicates that operations on metadata are very likely to cause a bottleneck [13; 11; 15]. Thus, the quality of data access in a multiorganizational distributed environment is strictly related to metadata management.. 1.1. Motivation. When a user accesses data, several aspects are typically considered important – these include the type of storage system, location of data and the state of the environment (e.g. availability of storage space, its load and type, number of users accessing storage, etc.) Minimizing the negative impact of data distribution, ensuring high availability of data, providing data replication and limiting delays in accessing replicas comprise another group of important issues. Dealing with such aspects is usually inconvenient for typical users. Hence, the topic of this dissertation is a study on provisioning the distributed data in a transparent and effective manner. As a result of the author’s involvement in various national and international projects (see Chapter 1.4), strong demand for a system for accessing heterogeneous distributed data which.

(14) 1.2. THESIS STATEMENT AND RESEARCH OBJECTIVE. 4. would be user-friendly, efficient and scalable has become evident. This observation has resulted in attempts to capture customer needs in terms of a general use case and the required functionality, which, according to system engineering principles, expresses the most important factors at the early stage of development. Since – as previously mentioned – the development of such a system is a complex task, we formulate the following overall research question: At large scale, how is the system built to fulfill the requirements of the multiorganizational community to offer user-friendly, efficient and scalable access to heterogeneous, distributed data resources? What are its unique features? In addition to the above, we also state a more specific question: What are the main elements of the system? In order to address these research questions we specify the overall concept, architecture and core elements of the proposed system.. 1.2. Thesis Statement and Research Objective. When analyzing data access in an organizationally distributed environment, we should mention elements which influence user operation, such as a consistent view of the distributed data, efficient data reading and/or writing capabilities as well as avoiding or demanding data redundancy. Since many of these features contradict each other, according to the CAP theorem [3] it is impossible to select a single set of mutually non-exclusive features that satisfy all stakeholders. On the basis of the shortcomings of existing solutions (discussed in the State of the Art section) along with our initial experimental environments and tests [138; 137] we have identified context awareness, represented by metadata management, as the main aspect of this study. Consequently, the thesis of the dissertation may be defined as follows: In distributed storage infrastructures context awareness enables data to be provisioned to users in a transparent manner while maintaining quality of access. The above thesis includes three important terms: • context awareness – the ability of entities to sense and react to the state of their environment [17]; in this work context is conflated with metadata which expresses knowledge of the circumstances of data access, e.g., the environment and the user’s expectations, • data provisioning transparency – a concept which stipulates that problems associated with differing data formats, storage systems and locations must be concealed from users; instead, users access data via logical paths while the infrastructure handles the underlying technical aspects, • quality of access (QoA) – Quality of Service (QoS) considerations related to data access [29]. The overall research objective is to develop an approach for transparent and easy provisioning of distributed data while maintaining quality of access. Transparent provisioning of distributed data not only results in access simplicity but also in great management possibilities. Data can.

(15) 1.3. THESIS CONTRIBUTION. 5. be automatically migrated and/or replicated in a transparent manner to decrease access latency or increase throughput. However, the concept of data access transparency does not assume any goals regarding the data access system. Instead, transparent data management should implement a specific policy (follow particular guidelines – see Chapter 1.6), e.g., improving access throughput or providing additional security mechanisms. Moreover, different policies can be applied to different datasets or groups of users to address the above-mentioned problem of mutually exclusive features. QoS requirements can be defined in the form of a Service Level Agreement (SLA) [16] to formally describe aspects of quality. The agreement may also specify best-effort mechanisms along with provisioning of characteristics that are difficult to measure (e.g., simplicity). The term QoA is used in this thesis to describe any attempt to deliver the desired access characteristics. Maintaining quality of access means that provisioning of new/upgraded features/characteristics does not result in the loss of any other desired characteristics. In particular, implementing a layer of abstraction that conceals data location from users (simplifying access) should not result in decreased access performance. While quality trade-offs are allowed, they should always relate to a specific policy (e.g., loss of performance due to use of safer but slower storage) rather than to implementation aspects of the data transparency mechanism itself. To achieve the stated objective, the following research steps are performed: • elaboration of a model which reflects the overall idea of the system, • designing a system architecture that represents the model, • implementation of core system elements, • practical verification of the approach in real and simulated computing environments. To validate the thesis and complete the main objective the Model of Transparent Data Access with Context Awareness (MACAS) is created. In addition to modeling easy and transparent data access, it also enables implementation of different policies, e.g., focusing on data access efficiency and scalability or security (see Figure 1.3). To achieve these goals, appropriate knowledge about several aspects of data access (data access context) is modelled as a set of metadata, hence metadata definitions constitute an important element of MACAS. The MACAS model introduces various abstractions which enable implementation of different levels of data storage, access and management (see Table 1.1). Practical verification focuses on federated multiorganizational environments as a core functionality of the model, disregarding issues of trust which typically arise in NFOs.. 1.3. Thesis Contribution. The work performed in this dissertation is based on collaborative research projects. In Figure 1.4 the author’s contribution is highlighted in green, while black text represents collaborative work. The presented work is aligned with the trend of Software Defined Storage (SDS) [99]. The author’s involvement is as follows: 1. Contribution to developing assumptions for the MACAS-compliant system – identifying data access stakeholders, limitations of organizational distribution as well as functional and non-functional requirements to be included in the model..

(16) 6. 1.3. THESIS CONTRIBUTION. Figure 1.3: Proposed evolution of data access model. Figure 1.4: Contribution of the author (green) and collaborative tasks in which the author was involved (black).

(17) 1.4. NOTE ON PARTICIPATION IN RESEARCH PROJECTS. 7. 2. Design of MACAS – a Model of Transparent Data Access with Context Awareness including transparent data management assisted by data access context. It models policy-based provisioning with hardware-independent data management and provides sufficient elasticity to account for dynamically adjusted features depending on the requirements of a particular user application. 3. Co-design of the data access system by mapping elements of MACAS to elements of the system architecture. The design takes into account efficiency, distributed management and diversity of solutions and policies. 4. Implementation of the system core. Participation in the implementation of components representing elements of the system architecture. 5. Design of experiments and tests that verify the model based on popular benchmarks and tools. Participation in the implementation of tests. The most important novelty of the author’s contribution is context awareness, represented by metadata, included in the MACAS model to accommodate dynamic changes in various factors that affect data access. Elaboration of metadata classes with different consistency and synchronization models to avoid context information processing bottlenecks should also be acknowledged. The implementation focuses on the ability to process large volumes of metadata as well as notifications, ultimately enabling the system to provision broad and up-to-date context information.. 1.4. Note on Participation in Research Projects. The author has actively participated in several research projects related to distributed systems. The main background is provided by collaboration with the Academic Computer Centre Cyfronet AGH [62] and by postgraduate research at the Department of Computer Science of the Faculty of Computer Science, Electronics and Telecommunication of the AGH University of Science and Technology. Participation in the PL-Grid family of projects [92] has provided insight into data management in federated computing infrastructures. In the PL-Grid PLUS [91] and PL-Grid CORE [90] projects the author was responsible for a development team working on implementation of tools simplifying access and management of data stored within the PL-Grid infrastructure. The author was involved in the QStorMan project [28], developing a tool for optimization of use of storage resources in accordance with user requirements. Work for INDIGO [81] and EGI ENGAGE [72] projects involved exploration of user requirements for organizationally distributed environments in the context of collaboration and data sharing. Basic experience was gained from the European Defence Agency EUSAS [139; 123; 122; 117] and national Rehab [96; 118; 145] projects focusing on data farming methodology applications as well as holistic rehabilitation of stroke patients with help from computer games.. 1.5. Thesis Structure. The remainder of the thesis is organized as follows. Chapter 2 contains an overview of data access tools and environments important for the thesis. In Chapter 3 data access stakeholders and their requirements are identified. On the basis of this analysis, the Model of Transparent Data Access with Context Awareness is introduced. Selected aspects of the model implementation.

(18) 1.6. DEFINITIONS OF TERMS. 8. are presented in Chapter 4 while Chapter 5 discusses its experimental evaluation. The final Chapter outlines conclusions and plans for future work.. 1.6. Definitions of Terms. The following terms are used in this thesis: • Basic terms: – Dataset: a collection of data that may be perceived as a coherent whole. – Metadata: data [information] that provides information about other data and/or access context. – Data access: storing, retrieving and manipulating data. – Data access context: environment / condition in which data is accessed. – Data access context awareness: a property of data access systems that allows the system to understand the environment / conditions in which the system operates and react to changes in the environment / conditions. Data access context is represented by metadata that describes knowledge concerning the circumstances of data access. – Data access and management policy: a set of guidelines that should be followed during data access and management to fulfill requirements. • Elements of the environment: – Site: a set of closely linked computing and/or storage resources. – Client: a software entity used by the user to operate on data. – FUSE: Filesystem in Userspace: a software interface that allows non-privileged users to create their own filesystems without editing kernel code. – FUSE client: a client that is based on FUSE. • Actors involved in data access: – Organization: an entity that has a collective goal and is linked to the external environment. – Provider: an organization that owns/operates computing and/or storage resources and provides them to the user. The provider’s resources may form one or several sites. Data is stored and manipulated using software and hardware that belong to a particular provider. – User: a person or organization that possesses some data and/or performs computations. – Developer: producer of applications/services that offer particular features to users. • Terms used to describe relations between elements of the environment: – Federation: a group of computing or network providers who agree upon standards of operation in a collective fashion. – Nonfederated organizations (NFOs): organizations that do not have any cooperation agreement in force..

(19) 1.6. DEFINITIONS OF TERMS. 9. – Support of user by provider: a term that describes the relation between the user and the provider when the provider makes its resources available for the user to store/access/process data. In such a case, the provider stores and manages metadata to provide data access that fulfills the user’s requirements..

(20) 2 Background Survey. According to the CAP theorem [3] it is impossible for a distributed data store to simultaneously provide more than two of the following three guarantees: consistency, availability and partition tolerance. When supporting Open Science, Big Data processing and cooperation among users who work with organizationally distributed resources, it is particularly important to provide availability and partition tolerance. This statement can be justified by a use case where several scientists process different parts of a read-only dataset, e.g., they analyze different scans of the human brain. Analysis of each part of the dataset takes a lot of time and requires significant computational resources, hence it is conducted at several computing centers. While processing a particular portion of the dataset does not require access to its full contents, outcomes should be made available to everyone to facilitate further comparison of results obtained for different parts of the dataset. As processing of each part can take days or weeks, it is critical to allow processing of selected parts even when other parts are temporarily unavailable or the connection to other computing centers that take part in processing has been lost. For such a use case it is more important to ultimately provide access to results produced using multiple parts than to ensure a consistent view of temporary data during processing. Thus, the author focuses on tools that allow efficient data access at any given time regardless of consistency. Tools such as GlobalFS [48], that aim to ensure strongly consistent filesystem operations in the event of node failures, but come at the cost of reduced availability, are not considered. This chapter begins with an overview of computational environments followed by a discussion of typical data access tools. Subsequently, other solutions for data access are investigated. Finally, problems related to data access and management in organizationally distributed environments are summarized.. 2.1. Computational Environments. Since scientific experiments often require substantial computing power, the author focuses on two most popular large-scale approaches: grid and cloud computing (also referred to as the grid and the cloud respectively). The aim of both is resource sharing but the way in which the resources are offered differs. 2.1.1. Grid Computing. The computing grid is “a hardware and software infrastructure that provides dependable, consistent, pervasive, and inexpensive access to high-end computational capabilities” [2]. The essence.

(21) 2.1. COMPUTATIONAL ENVIRONMENTS. 11. of the grid is (1) coordination of resources that are not subject to centralized control, (2) use of standard, open, general-purpose protocols and interfaces, (3) providing nontrivial quality of service [4]. Other definitions emphasize the use of virtualization to present a unified system image [108]. Two important aspects for this thesis appear in the above-mentioned definitions: linking resources contributed by several organizations and achieving a unified system image. From the user’s point of view, the grid can be perceived as a coherent whole due to its dedicated middleware and single sign-on features. However, sites have local, independently managed storage systems and may differ with regard to data access policies. The existence of multiple storage systems inside the grid, each managed by a different organization, makes data access and management difficult for users as well as for providers. Moreover, most storage solutions are suited only for a limited number of use cases and the user is often forced to combine several tools to achieve the desired effect (see Chapter 2.2). Thus, there is room for solutions simplifying data access and management, but grid solutions are loosing their importance currently. 2.1.2. Cloud Computing. According to the NIST1 definition, “cloud computing is a model for enabling ubiquitous, convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction” [104]. Cloud computing is based on virtualization of resources, which can be monitored, controlled and subjected to accounting. There are three basic cloud service models: • Software as a Service (SaaS), • Platform as a Service (PaaS), • Infrastructure as a Service (IaaS). SaaS offers end-user applications running in the cloud that are accessible from various devices. Examples include Gmail and Google Docs. When using PaaS, in contrast to SaaS, the user has control over the deployed applications and may configure the application hosting environment. This provides greater elasticity than SaaS. Examples include Google App Engine and Microsoft Azure. IaaS (e.g., Amazon Cloud) allows users to run arbitrary software with limited control of networking components (e.g., host firewalls) while in SaaS and PaaS users have no control over the network. IaaS offers the greatest elasticity but also requires substantial knowledge, including familiarity with system administration tasks. The two basic cloud deployment models are private and public clouds. However, two additional models can be found in literature: community and hybrid clouds [104]. A private cloud is an infrastructure dedicated to exclusive use by a single organization, whereas a public cloud is provisioned for open use by the general public. It is often owned, managed, and operated by a business and users are billed for its use. Both private and public clouds can be managed by a third party – the type of deployment depends on the user, not the managing entity. If a cloud infrastructure is provisioned for exclusive use by a specific community of consumers (more than a single organization, but without open access), it is called a community cloud. 1 National. Institute of Standards and Technology, U.S. Department of Commerce.

(22) 2.2. TYPICAL GRID AND CLOUD DATA ACCESS TOOLS. 12. Community clouds are often owned and managed by organizations that have shared concerns (e.g., mission, security requirements, policy, and compliance considerations). A hybrid cloud is a composition of two or more of the above-mentioned infrastructures (private, community, or public) that remain unique entities but are bound together by standardized or proprietary data and application technology. Although many cloud environments provide data access transparency, this topic is also addressed by the thesis. The main assumption and motivation for the work is the diversity of clouds and their users. Different scientists use different clouds and it is often impossible to process data of all cooperating scientists using a single cloud due to limited resources, dataset migration overhead or formal reasons (e.g., funding obtained for usage of a particular environment). Thus, aspects of cloud users’ collaboration remain an interesting research topic and are addressed by the presented work.. 2.2. Typical Grid and Cloud Data Access Tools. Grid and cloud providers offer storage systems for various purposes. The grid usually supports such solutions as (1) scratch for intermediate job results and data processing; (2) long-term data storage for final job results, often accessible through a dedicated API, appropriate for sharing data between different sites. Providers can also offer (3) object storage that manages data as objects (e.g., Amazon S3). Access to objects is fast and scalable, but there is no hierarchical structure or block access such as in traditional filesystems. Object storage is designed to deliver multiple online storage services, whereas traditional storage systems are primarily designed for high-performance computing and transaction processing. Selected examples of tools used in the grid and cloud environments are outlined below. Lustre [83] is a parallel distributed filesystem for computational clusters. The Lustre filesystem is often used as a high-performance scratch system in the grid. In such cases, there are usually different Lustre instances on different sites, which means that data stored in this filesystem can only be shared within the local cluster. Although the efficiency of the Lustre system is high, it may nevertheless be improved through dedicated tools such as QstorMan [28; 32; 120]. QstorMan aims at delivering storage QoS and resource usage optimization for applications that use the Lustre filesystem. QStorMan continuously monitors Lustre nodes and dynamically forwards data access requests to storage resources according to predefined storage QoS requirements. The usage of QStorMan has been shown to improve the data access efficiency of PL-Grid data-intensive applications [120]. Another tool often used as a high-performance scratch system is GPFS [80; 46]: a technology provided by IBM that offers similar usage characteristics to Lustre. In order to make data available outside the site, it should be copied to permanent storage outside the local cluster; e.g., LFC (LCG File Catalog), which is a storage mechanism for metadata management that provides common filesystem functionality for distributed storage resources [8; 22]. It supports file replication for better data protection and availability. It is commonly used with [109] command line utilities. Direct access from the application source code is possible using the GFAL API [77]. Since many users consider the usage of dedicated commandline utilities or APIs a drawback, they may use a FUSE-based [75; 54] implementation of the filesystem called GFAL-FS [77], which provides access to data in the same manner as in a regular Unix-like filesystem. However, data access via GFAL-FS is slower (compared to command-line utilities or GFAL API) and only available in read-only mode..

(23) 2.3. TOOLS FOR ANYTIME/ANYPLACE DATA ACCESS. 13. OpenStack Object Store (known as Swift [88]) is an example of object storage often used in the cloud. It is able to provide common file names within the grid and cloud infrastructures, and can therefore be applied in similar use cases to LFC. However, the Swift file sharing mechanism (which makes use of API access key sharing or session tokens) seems to be more troublesome for most users compared to LFC file sharing mechanisms based on Unix permissions. There are several reasons behind the high heterogeneity of storage systems: • user requirements for storage resources with different characteristics depending on the application, • local resource policies, • use of spare storage resources which already exist at the given location. Users often emphasize the importance of specific use cases, such as archiving and efficient access to temporary files [132]. Currently, such use cases are served by different storage systems and tools on different sites. As a result, standard grid and cloud environments do not offer data access transparency when deployed in a multi-organization environment due to the heterogeneity of storage systems. The lack of easy transparent data access results in management problems from the provider’s point of view. Less technically advanced users often work only with scratch storage and manually manage data transfers using SSH-based protocols for both file sharing and staging prior to job execution. This results in suboptimal use of storage and computing resources. Thus, both users and providers require better data access methods.. 2.3. Tools for Anytime/Anyplace Data Access. Although anytime/anyplace data access is a term used mainly in marketing, in the author’s opinion it accurately reflects the focus of tools described in this chapter. This section provides a description of solutions whose primary objective is to provision data to the user regardless of the access point. When the user works with resources on multiple sites, all data must be available for each user process on each site. Existing tools for anytime/anyplace data access focus on ease of access. The most popular ones are Dropbox, OneDrive and Google Drive [71]. Client applications are provided for the most popular operating systems, enabling a virtual filesystem to be mounted in order to transparently handle synchronization with cloud storage. If any operation performed without connection to the Internet conflicts with serverside changes performed by other clients, users can resolve the conflict on their own. Other significant features include file sharing mechanisms which allow users to easily publish their data. These tools impose rigorous limits on storage size and transfer speed, which become an obstacle when the research is conducted in a geographically distributed manner and data requires online synchronization across sites. The user has to carefully plan data processing depending on where the data has been generated and what transfer/synchronization operations are foreseen. Another similar sync-and-share tool is ownCloud [34; 41]. It enables users to maintain full control over data location and transfer while hiding the underlying storage infrastructure, abstracting file storage available through directory structures or WebDAV. It also provides file synchronization between various operating systems, sharing of files using public URLs, and support for external cloud storage services. Although ownCloud is more flexible than the previously mentioned tools, its performance is also not sufficient for data-intensive applications..

(24) 2.4. TOOLS FOR DISTRIBUTED DATA PROCESSING. 2.4. 14. Tools for Distributed Data Processing. One of the most prominent tools for remote data access is Globus Connect [45]. It is built on the GridFTP protocol to provide fast data transfer and sharing capabilities inside an organization. Globus Connect focuses on data transfer and does not abstract access to existing data resources. Thus, it does not provide any data access transparency. Another option for distributed environments is to provision storage resources through a high-performance parallel filesystem. Solutions of this type intend to provide access to storage resources optimized for performance. They are usually built on top of dedicated storage resources (e.g., RAIDs), and expose POSIX-compliant interfaces. Examples include BeeGFS (formerly FhGFS) [65], GlusterFS [76], Coda [103], and PanFS [89]. As there are significant differences between these systems in terms of data access, their most important features are presented below. BeeGFS [65] is an excellent example of a high-performance parallel filesystem because it uses many typical mechanisms for this type of tool. It combines multiple storage servers to provide shared network storage resource with striped file contents. Built on scalable multithreaded core components with native Infiniband support, BeeGFS has no architectural bottlenecks. It distributes file contents across multiple storage servers and likewise distributes filesystem metadata across multiple metadata servers. This results in high availability and low metadata access latency. Even given the multitude of metadata servers, it is guaranteed that changes to a file or directory by one client are immediately visible to other clients. BeeGFS has no support for integration of resources managed by several independent organizations. It can be used within a local storage area, but cannot easily provide transparent data access for organizationally distributed environments. GlusterFS constitutes an interesting alternative to metadata servers. It uses an elastic hashing algorithm that allows each node to access data without use of metadata or location servers. Storage system nodes have the intelligence to locate any piece of data without looking it up in an index or querying another server. This parallelizes data access and ensures good performance scaling. GlusterFS can scale up to petabytes of storage, available to the user under a single mount point. The developers of GlusterFS [76] point to the use of an elastic hashing algorithm as the heart of Gluster’s fundamental advantages, which results in good performance, availability, stability, and a reduction in the risk of data loss and corruption or the data becoming unavailable. The use of hashing algorithms minimizes traffic flow; however, usage of metadata servers results in better elasticity and easier reconfiguration. On the other hand, the hashing algorithms require more work when a group of data servers is reconfigured, so both solutions have pros and cons; one’s choice of solution should depend on the use case. Similarly to BeeGFS, GlusterFS is dedicated to local storage infrastructures with no strong support for an organizationally distributed environment. Coda [103] is an example of a system with strict support for disconnected mode operations. It offers high availability of files by replicating a file volume across many servers and caching files on the client machine. The server and client communicate through Remote Procedure Calls (RPC). When a server is notified of file updates, it instructs clients who cache copies of the affected files to invalidate these copies. Due to client-side replication of data, the user is able to continue working in case of a network failure. However, aggressive caching may lead to conflicts. Automatic conflict resolution may involve data loss (the user may be unaware of the problem).

(25) 2.4. TOOLS FOR DISTRIBUTED DATA PROCESSING. 15. while manual conflict resolution is inconvenient for users. The main drawback of Coda is its lack of support for organizational distribution of an environment (its cache coherency algorithm would result in high utilization of the network between sites [74]). Another interesting tool is PanFS [89]. This solution creates a single high-performance pool of storage under a global namespace. While most storage systems loosely couple the parallel filesystem software with legacy block storage arrays, PanFS combines the functions of a parallel filesystem, volume manager and RAID engine into one holistic platform. It also makes efficient use of SSD storage to improve performance. It is a commercial tool dedicated to building high-performance storage solutions for a single organization. Thus, similarly to the solutions mentioned above, it is not appropriate for data access in organizationally distributed environments. It is also important to mention solutions built on the top of object storage systems, like CephFS [11; 55; 58], DDN’s WOS [107] and Scality Ring [98]. While CephFS provides a POSIXcompliant distributed filesystem based upon RADOS [95], WOS delivers only true object storage with no underlying filesystem. The architecture of WOS consists of three components: building blocks, WOS Core software, and a selection of simple interfaces. WOS Core has self-healing capabilities and a single management console for the entire infrastructure. The distribution of data in WOS may be configured to use geographic replication for disaster protection. Scality Ring is a software-only storage solution built using a distributed share-nothing architecture. A distinguishing feature of this tool is its built-in tiering that provides high flexibility in storage configuration. Unfortunately, none of these tools offer transparent data access when deployed over distributed resources managed by several NFOs. Hadoop Distributed File System (HDFS) [21; 52] offers support for the map-reduce paradigm, which is another direction of effective distributed data processing. It is designed to stream large datasets at high bandwidth for the user’s processes. Data and metadata are stored separately. HDFS benefits from replication of data among the available storage resources, but the metadata server can be considered a single point of failure that decreases the level of fault tolerance of the system. Another interesting tool that supports the map-reduce paradigm is Tachyon [102], which provides high performance for map-reduce applications by aggressively using memory. Neither HDFS nor Tachyon envision transparent data access in a federation. Although HDFS supports federations, the purpose of an HDFS federation is to improve scalability and isolation rather than to hide data distribution from the user. The above-mentioned tools differ in terms of implementation details, which also affects their non-functional features. GlusterFS uses an elastic hashing algorithm, while CephFS uses a metadata server and Scality RING utilizes a routing-based algorithm within a P2P network. The presented systems use different approaches to offline work and caching. Moreover, some of them require dedicated hardware, increasing efficiency at the cost of additional investments. Despite their variety, the presented tools are not well suited for organizationally distributed environments due to their limited support for provider cooperation. This is caused by their centrally managed storage systems, able to provide transparent data access only inside a single organization. Moreover, most of the above-mentioned tools are also difficult to use because of limited support for deployment on resources where some data is already stored..

(26) 2.5. TOOLS FOR UNIFIED VIEW OF MULTIORGANIZATIONAL DATA. 2.5. 16. Tools for Unified View of Multiorganizational Data. Another type of solution for exposing storage resources comprises systems that provide a layer of abstraction on top of storage resources across multiple organizations. They can provide a consistent view of user data stored in different systems. They expose a single namespace for data storage and often facilitate data management by enabling providers to define data management rules. An exemplary tool is iRODS [19; 33] developed for grid environments. Under iRODS data can be stored in designated folders on any number of data servers. To integrate various external data management systems, such as GridFTP-enabled systems, SRM-compatible systems or Amazon’s S3 service, a plug-in mechanism can be used. Data integration in the iRODS system is based on a metadata catalogue – iCAT – that is involved in the processing of the majority of data access requests. Metadata describing the actual data stored in the system includes information such as file name, size and location, along with user-defined parameters. The user can search for data that has been tagged, while the administrator of the system can query the metadata catalogue directly by using an SQL-like language to provide aggregated information about the system. Although iCAT provides a rich set of features for both users and administrators, it also represents a weakness of the iRODS system due to its reliance on a relational database – potentially a systemic bottleneck and a single point of failure. Due to iRODS’s ability to adjust data management to provider/user needs, it is often referred to as adaptive middleware. To allow for dynamic adaptation, the iRODS system uses rules. A rule is a chain of activities provided by low-level modules (built in or supported externally) to facilitate the required functionality. User actions are monitored by the rule engine which can activate specific rules. Typical user interfaces are available – utilization of a POSIX interface on any FUSE-compatible Unix system [75] is enabled by a FUSE-based filesystem provided by iRODS. The iRODS system provides built-in support for federalization through its Zone mechanism. A Zone is a single iRODS installation. Each iRODS Zone is a separate entity that can be independently managed and can include different storage resources. To access data located in another Zone, dedicated user accounts can be created in a remote Zone with a pointer to the home Zone of the user. Although iRODS is a powerful and flexible tool that allows for connecting organizationally distributed installations, it also has some drawbacks. For example, it does not provide location transparency for data stored across multiple federated iRODS installations. Users manage data locations by themselves, so data access transparency is not maintained. Parrot [10] allows for attaching existing programs to remote data management systems which expose other access protocols (e.g., HTTP, FTP or XRootD) through the filesystem interface. Parrot utilizes a ptrace debugging interface to trap the system calls of a program and replace them with remote I/O operations. As a result, remote data can be accessed in the same way as local files. Unfortunately, the performance of Parrot is limited, as ptrace can generate significant overhead [10]. As a result, Parrot is not well suited for data-intensive applications. Other interesting solutions include Syndicate Drive [101] and Storj [100]. Syndicate Drive is a virtual cloud storage system that combines the advantages of local storage, cloud storage, commodity CDNs, and network caches. Storj is a peer-to-peer cloud-storage network implementing end-to-end encryption to allow users to transfer and share data without support from a thirdparty data provider. Although these solutions contain algorithms that speed up data access, they are rarely used in common computing infrastructures. The requirement to download data.

(27) 17. 2.6. SUMMARY. Table 2.1: Features of data access solutions Feature. Tool example. Anytime/anyplace data access with location transparency Increased storage system efficiency for chosen application on demand Increased efficiency of data access due to use of dedicated hardware Efficient work with many clients owing to replication of components Fast and reliable data transfer between sites thanks to efficient protocols and transfer supervision Geographic replication that results in disaster protection and reduces risk of data loss Stable and efficient operation when the network is slow or unreliable through client-side caching and strict support for disconnected mode High flexibility in storage configuration due to built-in tiering Dynamic adaptation and adjusting data management behaviour to provider/user needs due to rules subsystem Support for distributed data management in federations through Zones mechanism Creating a unified view over independent data sources by providing the ability to attach remote data management systems Integration with various storage systems through plug-ins Discovery and identification of datasets. Dropbox QStorMan PanFS BeeGFS Globus Connect WOS Coda Scality Ring iRODS iRODS Parrot iRODS DataCite. prior to its use is a drawback, since such preparation should be performed in the background. Another system worth mentioning is FAX (Federating ATLAS storage systems using XrootD) [40]. FAX brings Tier 1, Tier 2, and Tier 3 storage resources together into a common namespace that is accessible anywhere. FAX Client software tools (e.g., ROOT, xrdcp) are able to reach data regardless of its location. However, the N2N component that is responsible for mapping global names to physical file names may be a performance bottleneck because of its reliance on LFC. To enable discovery and identification of datasets, open access services such as DataCite [68; 50] or OpenAIRE [31; 51] can be used. These services rely on standards such as OAI-PMH [7] for integration with existing platforms for publication metadata harvesting, and identify datasets through globally unique handles such as DOI [27] or PID. However, these services do not directly address the issue of accessing the underlying data by end users. In this context, we should also mention National Data Storage (NDS) [93; 20] – a national initiative and pilot implementation of a distributed data storage system intended to provide high quality backup, archiving and data access services. It introduces several useful features such as advanced monitoring and prediction, as well as replication techniques to increase availability and performance of data access. However, it lacks ease of deployment and scalability. The tools described in this chapter are all high-level data management-oriented. The standard POSIX filesystem interface is arguably preferable for most applications. For this reason, the effort undertaken to abstract any custom interface with a POSIX overlay is appreciated by users. However, the main goal of these systems is to enable data access from anywhere in a uniform way, rather than to achieve high performance. Hence, despite the comfort of data access and management that they offer, their applicability to HPC application execution is limited.. 2.6. Summary. This chapter summarizes existing data access solutions and identifies several interesting features, as listed in Table 2.1. The features offered by existing tools can be harnessed to meet the requirements of various user groups. However, all of these solutions have drawbacks (see Tables 2.2 and Figure 2.1). In.

(28) 2.6. SUMMARY. 18. Figure 2.1: Drawbacks of existing tools for federated computational environments (red - drawbacks, green - advantages offset by other drawbacks). particular, none of the analyzed solutions support all of the following features: • transparent anytime/anyplace data access, • high efficiency and scalability, • distributed (decentralized) data management. To the best of the author’s knowledge, none of the existing services or tools combine all three of the listed elements. Limitations of existing tools have led to various extensions, e.g. IBM AFM (Active File Management) [63] for GPFS or xCache [82] for xRootD. Both AFM and xCache provide additional caching that improves the quality of work in a federation. However, such improvement is only possible when all providers agree to use a common storage solution (GPFS or xRootD) across all sites – and this is often impractical. Furthermore, existing initiatives also introduce certain drawbacks. For example, NDS [93] lacks ease of deployment and scalability, while the DataNet Federation Consortium (DFC) [69] is based on iRODS and therefore exhibits the iRODS drawbacks described in the previous chapter..

(29) 19. 2.6. SUMMARY. Table 2.2: Existing solution characteristics Solution type Common grid/cloud data access tools Tools for anytime/ anyplace data access Tools for fast data movement. Data access solution Lustre GPFS LFC Swift QStorMan Dropbox Onedrive Google Drive ownCloud Globus Connect. Characteristics High-performance cluster solution. Use of several tools together is needed - no data access Provide common file names within grid or cloud transparency Improves data access performance on demand Easy to use. Limits on storage size and transfer speed. Provides fast data movement and data sharing capabilities based on GridFTP. Does not abstract access to data resources. High availability and performance due to scalable multithreaded core components with native Infiniband support Scalability and performance due to elastic GlusterFS hashing algorithm High availability due to strict support for Coda disconnected mode operations High performance due to combination of functionality of parallel filesystem, volume PanFS manager and RAID engine into one holistic platform Solutions CephFS POSIX-compliant distributed filesystem based DDN’s WOS True object storage on object Software-only storage solution with built-in Scality Ring tiering that provides high flexibility in storage storage configuration Designed to stream large datasets at high Tools for HDFS bandwidth to user processes map-reduce Provides high performance for map-reduce Tachyon applications using memory aggressively Integrate various external data management iRODS systems using metadata catalogue - iCAT Tools for Allows attaching existing programs to remote unified Parrot data management systems through file system view of interface using ptrace debugging interface multiSyndicate organizational Based on data download before usage Drive data Storj Brings Tier 1, Tier 2 and Tier 3 storage FAX resources together into a common namespace DataCite Enable discovery and identification of datasets OpenAIRE Provides high quality backup, archiving and NDS data access services Highperformance parallel file systems. Drawbacks. BeeGFS. Designed to be used by single organization - no support for organizationally distributed environments. No support for organizationally distributed environments. Performance and/or scalability is not sufficiently high for data-intensive applications.

(30) MACAS - Model of Transpar-. 3. ent Data Access with Context Awareness. This chapter defines the Model of Transparent Data Access with Context Awareness (MACAS) which enables transparent, easy, efficient and scalable data access supported by knowledge of the distributed environment (i.e. by the context). MACAS assumes that data is stored on sites that are managed by various providers (see Figure 3.1) and accessed by users using client software. The model introduces a set of layers, each of which provides certain features to fulfill the requirements of a data access stakeholder, and cross-cutting concerns (later referred to as concerns) that describe aspects that affect many layers at once. Each layer makes use of different metadata which describes the context needed to fulfill its tasks. The model defines not only metadata which is required at a particular layer, but also describes metadata consistency and synchronization in a distributed environment. Finally, MACAS defines an algorithm that shows how the data access system should use layers and concerns to provide functionality to the user. Thus, it can be concluded that MACAS is defined by the following elements: metadata, layers and concerns and the algorithm making use of the mentioned above elements. Metadata consistency and synchronization models determine guarantees of consistency, availability and partition tolerance (see CAP theorem [3]) that can be provided by the system based on the MACAS model. Provider independence is a strong argument to consider availability and partition tolerance as more important than consistency – this is because data availability should not be limited by the state of resources that are not managed by the provider that hosts the data. However, lack of consistency results in different data views on different sites. Data access cannot be considered transparent if it depends on the place of access. Thus, MACAS divides metadata into groups that are treated differently to provide a balance between these mutually exclusive guarantees. To complete the model, three steps must be performed: 1. identification of data access stakeholders and analysis of their requirements, 2. identification of metadata required to fulfill stakeholders’ requirements and definition of metadata classes with different consistency and synchronization models to avoid bottlenecks,.

(31) 3.1. DATA ACCESS STAKEHOLDERS. 21. Figure 3.1: Scheme of the data access system. Figure 3.2: Stakeholders’ relation to data access in federated computational environments. 3. design of MACAS layers and concerns, including the context and functionality of each MACAS layer and concern along with its algorithmic representation. The author performed the main tasks associated with steps 1-2, along with full personal involvement in step 3, i.e., in MACAS model development.. 3.1. Data Access Stakeholders. Three main classes of data access stakeholders can be identified: users, providers and developers. The user expects a set of specific features while the provider tries to satisfy the user’s requirements with the limited resources at his disposal (see Figure 3.2). The developer creates services that provide functionality for the user. These services represent IT platforms or tools which support use cases typical for a given scientific discipline. Thus, users and providers are the most important stakeholders. The developer can be perceived as an advanced user that requires additional functionality to allow integration of newly created services with the data access system. The main issues from the users’ perspective are enumerated below. They are encountered while dealing with Big Data characterized by volume, velocity, variety and value: 1. easy anytime/anywhere data access, 2. easy data sharing,.