• Nie Znaleziono Wyników

Politechnika Warszawska Warsaw University of Technology

N/A
N/A
Protected

Academic year: 2022

Share "Politechnika Warszawska Warsaw University of Technology"

Copied!
128
0
0

Pełen tekst

(1)

Politechnika Warszawska

Warsaw University of Technology http://repo.pw.edu.pl

Rodzaj dyplomu / Diploma type Rozprawa doktorska / PhD thesis

Autor / Author Tomaszuk Dominik

Tytuł / Title

Nowe metody dostępu do danych w strukturach grafowych RDF / New methods for accessing data in RDF graph structures

Rok powstania / Year of creation

Promotor / Supervisor Rybiński Henryk

Jednostka dyplomująca / Certifying unit Wydział Elektroniki i Technik Informacyjnych / Faculty of Electronics and Information Technology

Adres publikacji w Repozytorium URL / Publication address

in Repository http://repo.pw.edu.pl/info/phd/WUT262927/

(2)

WARSAW UNIVERSITY OF TECHNOLOGY

Faculty of Electronics

and Information Technology

Ph.D. THESIS

Dominik Tomaszuk, M.Sc.

New methods for accessing data in RDF graph structures

Supervisor Professor Henryk Rybi´nski, Ph.D., D.Sc.

Warsaw 2013

(3)
(4)

Abstract

The dissertation presents a novel approach to accessing data in RDF structures. In the Semantic Web no trust and access control mechanisms exist. The purpose of the thesis is to solve the problems of storing, representing and accessing knowledge stored in the RDF form, expanded with new possibilities of expressing trust. The thesis introduces the RDF graph store, and proposes semistructural documents together with serialization, trust metrics, methods of subgraph selection and a possibility of handling agent identity.

Furthermore, the model of a system, which utilizes the proposed algorithms and methods is proposed. Finally, the work describes the implementation and evaluation of a prototype system, which combines all of the introduced mechanisms. Performance is evaluated using benchmark data [23]. In comparison with other state-of-the-art RDF graph stores, the proposed methods and algorithms provide superior or comparable performance results, in spite of the fact that they are designed to work with trust and security measures.

Keywords: Semantic Web, Linked Data, Resource Description Framework, Authentication, Authorization, Trust metric, Graph store, Data space, Data integration.

ACM Classification: C.2.4 Distributed Systems, D.3.2 Language Classifications, E.1 Data Structures, E.2 Data Storage Representations, F.4.1 Mathematical Logic, H.2.4 Systems, I.2.4 Knowledge Representation Formalisms and Methods, K.6.5 Security and Protection.

(5)
(6)

Streszczenie

Rozprawa prezentuje nowatorskie podej´scie do metod dostepu do danych w strukturach, RDF. W Semantycznym Internecie (ang. Semantic Web) brak jest mechanizm´ow uwzgledniaj, acych zaufanie i kontrol, e dost, epu. Celem pracy jest rozwi, azanie problem´, ow magazynowania, reprezentacji i dostepu do wiedzy w oparciu o j, ezyk RDF poszerzony o, nowe mo˙zliwo´sci ekspresji zaufania. Praca przedstawia koncepcje magazynu graf´, ow RDF oraz proponuje semistrukturalne dokumenty wraz z serializacja uwzgl, edniaj, ac, a metryki, zaufania, metody wybierania podgraf´ow oraz mo˙zliwo´s´c obs lugi to˙zsamo´sci agent´ow.

Rozprawa prezentuje tak˙ze model systemu wykorzystujacego zaproponowane algorytmy i metody. Opisana jest r´ownie˙z implementacja i ocena prototypowego systemu, kt´ory laczy, w sobie wszystkie zaproponowane mechanizmy. Eksperymenty zosta ly przeprowadzone na podstawie danych wzorcowych [23]. W por´ownaniu do innych magazyn´ow graf´ow RDF, proponowane metody i algorytmy oferuja rozwi, azania o por´, ownywalnej bad´, z te˙z przewy˙zszajacej wydajno´sci, pomimo faktu, ˙ze s, a zaprojektowane do pracy z miarami, zaufania i bezpiecze´nstwa.

S lowa kluczowe: Semantyczny Internet, Laczenie danych, Resource Description, Framework, Uwierzytelnianie, Autoryzacja, Metryka zaudania, Magazyn graf´ow, Przestrze´n danych, Integracja danych.

(7)
(8)

Acknowledgements

I would like to thank my supervisor, Henryk Rybi´nski, for his support and guidance over the years. Additionally, he waded through several preliminary versions of my thesis and his insightful comments helped me to improve its structure significantly. Without his continuous belief in me, this work would not have been finished.

I am grateful to my family, particularly to my parents, for supporting me at every moment.

I would especially like to thank my fianc´ee, Justyna, for her endless understanding and patience. She has always been with me, for better and for worse. She never lost faith in me even when I lost faith in myself.

(9)
(10)

Contents

List of Figures . . . 11

List of Tables . . . 13

List of Algorithms . . . 14

List of Listings . . . 15

1. Introduction . . . 17

1.1. Semantic Web Context . . . 17

1.1.1. Resource Description Framework . . . 18

1.1.2. RDF Schema . . . 18

1.1.3. Web Ontology Language . . . 19

1.1.4. SPARQL Protocol and RDF Query Language . . . 19

1.1.5. Linked Data . . . 20

1.2. Background and Motivations . . . 21

1.3. Scope and Thesis . . . 23

1.4. Organization of the Work . . . 25

2. Basic Concepts . . . 26

2.1. RDF Triples . . . 26

2.2. RDF Graph and RDF Graph Store . . . 27

2.3. Document-Oriented Databases and Query Language . . . 30

2.4. Semantic Web Environment . . . 31

2.5. Authentication, Authorization and Trust . . . 32

3. Related Work . . . 35

3.1. Trust Metrics . . . 35

3.2. RDF Stores . . . 36

3.3. Serializations . . . 37

3.4. Matching Subgraph and Access to Store . . . 39

3.5. Authentication and Authorization Methods . . . 41

(11)

4. Data Structures for RDF . . . 44

4.1. Representing Trust Metrics with Annotations . . . 44

4.1.1. Preliminaries . . . 45

4.1.2. Annotated RDF with Trust Metrics . . . 47

4.1.3. Querying Trust with Extended SPARQL . . . 53

4.1.4. Mapping into the RDF model . . . 54

4.2. Document-oriented Graph Store . . . 57

4.2.1. Data Collections in Document-oriented Graph Store . . . 57

4.2.2. Document Serialization . . . 59

4.2.3. Generating Documents . . . 63

4.2.4. Mapping into Named Graphs . . . 65

5. Operations on RDF Graph Store . . . 68

5.1. Matching Subgraphs and Generating Views . . . 68

5.1.1. RESTful Management . . . 68

5.1.2. RESTful Querying . . . 70

5.1.3. Collection Views and Query Syntax . . . 74

5.1.4. Mapping from SPARQL . . . 77

5.2. Verifying Agent Identity . . . 78

5.2.1. Authentication . . . 78

5.2.2. Authorization . . . 83

5.2.3. Federated Identity . . . 87

6. Experiments . . . 91

6.1. Load tests . . . 92

6.2. Performance tests . . . 98

6.3. Black-box tests . . . 105

7. Conclusions . . . 109

A. Appendix: Syntaxes . . . 113

B. Appendix: System implementation . . . 115

Bibliography . . . 119

(12)

List of Figures

1.1 Semantic Web stack . . . 18

1.2 Example of datasets in Linked Data . . . 20

2.1 An RDF graph . . . 28

2.2 A Named Graph . . . 29

2.3 Document-oriented database architecture . . . 31

3.1 Relationship between RDF serializations . . . 39

3.2 Relationship between RDF query languages . . . 41

4.1 Hierarchy of RDF semantics . . . 45

4.2 Simple example of model explanation . . . 48

4.3 Example of M-annotated interpretation . . . 50

4.4 Example of using inference rules . . . 53

4.5 Document-oriented graph store model . . . 59

5.1 An AnAID authentication ontology . . . 81

5.2 Sequence diagram of authentication . . . 82

5.3 An AzAID authorization ontology . . . 85

5.4 Sequence diagram of federated identity . . . 89

6.1 Graph store layers . . . 91

6.2 Load test: Testbed comparison to Virtuoso . . . 94

6.3 Load test: Testbed comparison to Virtuoso and MySQL . . . 95

6.4 Load test: normalized textual RDFJD comparison to binary RDFJD . . . 96

6.5 Load test: Testbed with binary RDFJD comparison to Virtuoso and MySQL . . . . 97

6.6 Q1 performance test: Testbed and Virtuoso . . . 100

6.7 Q2 performance test: Testbed and Virtuoso . . . 101

6.8 Q3 performance test: Testbed and Virtuoso . . . 102

6.9 Q4 performance test: Testbed and Virtuoso . . . 103

6.10 Q5 performance test: Testbed and Virtuoso . . . 104

(13)

6.11 Requirement hierarchy . . . 107

6.12 Black-box testing structure . . . 108

A.1 Major productions for query language . . . 113

A.2 Major productions for documents . . . 114

B.1 Simplified system ontology . . . 115

(14)

List of Tables

4.1 Inference rules . . . 52

4.2 RDFJD directive document keys . . . 61

4.3 RDFJD statement document keys . . . 61

5.1 Mapping graph store operations to HTTP methods . . . 74

5.2 Special characters in RESTful queries . . . 76

5.3 Query string parameters in collection views . . . 77

5.4 Query string parameters in projecting . . . 77

5.5 Relationship between HTTP status codes and methods . . . 78

5.6 Relationship between update operation and SPARQL clauses . . . 78

5.7 Identity rules . . . 90

6.1 Load: dataset description . . . 93

6.2 Performance: dataset description . . . 98

6.3 Performance: queries description . . . 99

6.4 Data result characteristics . . . 99

6.5 Test cases . . . 106

(15)

List of Algorithms

4.1 Mapping to reification statements . . . 55

4.2 Mapping to Named Graphs . . . 56

4.3 Generating a directive document . . . 63

4.4 Generating a statement document . . . 64

4.5 Merging statement documents . . . 65

4.6 Mapping to Named Graphs . . . 66

5.1 Mapping from SPARQL . . . 79

5.2 Authorization process . . . 87

5.3 Graph store identity . . . 90

(16)

List of Listings

4.1 Extended SPARQL grammar . . . 54

4.2 Simple Trust Ontology . . . 55

5.1 Update Ontology . . . 74

5.2 RSA public key testing query . . . 82

5.3 hasAssistant property in AnAID ontology . . . 88

5.4 hasAssociation property in AnAID ontology . . . 90

6.1 Fragment of test manifest . . . 105

6.2 Fragment of test results . . . 108

B.1 Term interface . . . 116

B.2 Literal interfaces . . . 116

B.3 Resource interfaces . . . 116

B.4 Quad interface . . . 117

B.5 Collection interface . . . 117

B.6 Identity interfaces . . . 118

(17)
(18)

1. Introduction

1.1. Semantic Web Context

This section presents components of the Semantic Web [17, 97, 11], the aim of which is to increase machine support for the interpretation and integration of information on the World Wide Web (WWW).

The Semantic Web is an extension of the current web, so that information is given in a well-defined meaning. It is also decentralized and open to anyone to add and modify the content. The main features of the Semantic Web is a decentralization of the information and openness of the structure, so that various “knowledge bases” can refer to each other.

According to Tim Berners-Lee [17] the Semantic Web is a web of data, in some ways like a global database and it provides a common framework that allows data to be shared and reused across applications. In [97] the Semantic Web is presented as a web of actionable information. This information is derived from data through a semantic theory for interpreting the symbols. The semantic theory provides an account of “meaning” in which the logical connection of terms establishes interoperability between systems.

The main goal of the Semantic Web is to integrate data from various sources, making it possible to be processed by computers. Therefore, the current development of the Semantic Web is aimed at creating tools for the integration of knowledge, publishing it on the Internet and presenting it in a structured and useful form. It constitutes a novel and useful approach which has new capabilities. This approach provides opportunities to store structures, which are unrestricted by a schema.

Tim Berners-Lee also suggested that these assumptions depend on each other and were designed as a layered and functional architecture, which is presented in Figure 1.1.

This architecture is based on available standards for unique entity identification – Internationalized Resource Identifier (IRI) [43], encoding of character symbols – Unicode,

(19)

and formats for serializing [25, 34]. In the following sections we briefly describe the core layers of the architecture.

Applications

Syntax

Data interchange: RDF Querying:

SPARQL

Taxonomies: RDFS Ontology: OWL

Figure 1.1: Semantic Web stack

1.1.1. Resource Description Framework

A Resource Description Framework (RDF) [38, 87] is used as a general method for the conceptual description or the modeling of information that is available in web resources, using a variety of syntax formats. It provides the essential foundation and infrastructure to support the description and management of data. In other words, RDF is a very general data model for describing resources and relationships between them.

The RDF data model is similar to class diagrams, as it is based upon the idea of making statements about web resources in the form of subject-predicate-object expressions. These expressions are known as triples in the RDF terminology. It also provides a number of extra capabilities, such as built-in properties for representing groups of resources, and capabilities for representing Extensible Markup Language (XML) fragments as property values.

1.1.2. RDF Schema

An RDF Schema (RDFS)1[26] is a set of classes with certain properties using the RDF extensible knowledge representation language. It provides elements for the description of ontologies, otherwise called RDF vocabularies, intended to structure RDF resources. It is a collection of properties with clearly established references intended for use in RDF

1 RDF Working Group decided to change RDF Schema to RDF Vocabulary Description Language

(20)

graphs. In other words, it is intended to provide the primitives that are required to describe the vocabulary used in particular RDF models.

The purpose of RDFS is to serve as a language for defining an ontology with concepts, called classes, and instances called resources in the RDFS terminology. It uses classes and global domain and range restrictions for properties. It also provides the means to express hierarchies of these properties and classes by subsumption relationships. RDFS is used for defining taxonomies.

1.1.3. Web Ontology Language

A Web Ontology Language (OWL) [13, 96] is a knowledge representation language for authoring ontologies. OWL is also a semantic language for publishing and sharing ontologies. It supports more complex relationships between entities than RDFS.

Ontologies are a superset of taxonomies. They allow the specification of a network of information with logical relations, while taxonomies describe strictly hierarchical structures.

In the sense of the Semantic Web, an ontology consists of concepts, instances, relations, inheritance, axioms and rules. Each concept (called a class) can have many instances (called individuals in the OWL terminology). Concepts and instances may have relations with each other. A special kind of relation between classes is inheritance. A class can be a subclass of another class inheriting all its attributes. Axioms represent commonly accepted facts, while rules describe how information can be derived.

There are many works related to ontologies, such as: methods for the representation and processing of ontological knowledge [52], vocabularies for describing linked RDF datasets [10], or query answering for OWL-DL with rules [80]. In this work these problems are not considered.

1.1.4. SPARQL Protocol and RDF Query Language

Apart from knowledge representation languages, the Semantic Web also needs a query language. A SPARQL Protocol and RDF Query Language (SPARQL2) [61, 50] is an RDF query language, being able to retrieve and manipulate data stored in RDF databases and files.

2 SPARQL is a recursive acronym for SPARQL Protocol and RDF Query Language

(21)

SPARQL is a graph-matching query language. A query forms a pattern, which is matched against the source. It contains capabilities for the required querying and optional graph patterns along with their conjunctions and disjunctions. SPARQL also supports aggregation, negation, creating values using expressions, subqueries, extensible value testing, and constraining queries by source RDF graph. The results of those queries can be result sets or RDF graphs.

A central concept of SPARQL is the matching of the graph patterns, so called basic graph patterns (BGPs). A BGP is a set of triples, which become patterns by the introduction of a variable. SPARQL supports four kinds of queries, which can (1) return a set of bound variables in the form of result sets, (2) create and return RDF graphs, (3) return an RDF graph that describes the resources found and (4) check whether a query pattern has matches in the model graph or not. SPARQL 1.1 [50] also supports editing, adding, and removing triples as well as updating, creating, and removing RDF graphs.

1.1.5. Linked Data

DS7

DS1

DS2

DS3

DS4

DS5 DS6

Figure 1.2: Example of datasets in Linked Data

Linked Data [22] is an idea that describes a method of publishing structured data. This data can be interlinked so it becomes more useful. Linked Data builds upon standard Wide World Web technologies. It extends web pages for human readers to share information in such a way that they can be read automatically by machines. This enables data from different sources to be related and queried. It is linked to other external data sets and can, in turn, be linked to/from external data sets. In other words, Linked Data is a set of

(22)

documents, each containing a representation of a linked data graph. Figure 1.2 presents a visualization of the linked data sets to browse through the cloud.

Tim Berners-Lee outlined four principles of linked data in [15], paraphrasing the following rules:

1. Use IRIs to identify things.

2. Use IRIs in standard protocols so that these things can be referred to and dereferenced by people and agents.

3. Provide useful information about the thing when its resources are dereferenced, using standards, such as RDF, RDFS, OWL and SPARQL.

4. Include links to other exposed data to add context and improve discovery of other related information on the World Wide Web.

By employing resources to identify names, the well-known protocol as a retrieval mechanism, and the RDF data model to represent resource descriptions, Linked Data directly builds on the general architecture of the web. It is mainly about distributing structured data in RDF, in contrast to the fully-fledged Semantic Web that focuses on the ontological level.

1.2. Background and Motivations

Knowledge representation is the way in which knowledge is presented in a language.

Natural language can be construed as one of the methods of knowledge representation.

The main knowledge containing unit in such languages is a sentence that consists of a set of words arranged according to grammatical rules [24]. Despite the classification of natural languages from the perspective of word order that is characteristic for them, the method of knowledge representation expressed through such languages is deemed difficult to be processed by machines, as it is irregular.

The RDF language originating from the Semantic Web [17, 97, 11] and Linked Data [22]

concept used for knowledge representation on the Internet was a response to this problem.

This language and the concepts from which it originates have enabled free data exchange, formalisation and unification of the knowledge stored so far. An RDF assumption is to describe resources by means of the expression consisting of three elements (the so-called RDF triple): subject, predicate and object [38]. RDF borrows heavily from natural languages. An RDF triple may then be seen as a sentence with subject corresponding

(23)

to the subject of a sentence, predicate corresponding to its verb and object corresponding to its object. So the RDF language may be classified by means of the same syntactic criteria as natural languages. According to these premises, RDF belongs to the group of Subject Verb Object languages (SVO)3 [35]. The regularity of the RDF language makes for good knowledge representation that is processed easily by machines, and because of its structure that is similar to natural languages, it is also legible for people.

A collection of RDF statements (triples) establishes an RDF graph. Such an RDF graph is labelled directed graph. It may also contain recurring edges. Subjects and objects are nodes while predicates are labels in the graph. An RDF graph that is defined in such a manner is also called a default graph [38]. This model, when extended by graph provenance through adding the fourth element to the RDF triple, i.e. the context, is defined in literature [31] as a Named Graph.

The knowledge represented by the RDF language may be stored as files and in databases. In both cases RDF documents must be written in a proper way. In the database approach the concept of a graph store is usually used as a modifiable container of RDF graphs. Storing, representing and accessing RDF graph structures from the graph stores are the key research problems in the Semantic Web and Linked Data area. The problems are caused by inefficient data representation and storage, and with the difficulty in searching for information in a decentralized Semantic Web environment. In addition, many properties of classical databases are lost, particularly security aspects such as agent authentication and authorization. The research addressed in this work was motivated by the lack of solutions concerning these issues.

Purpose-built Semantic Web databases do not have any explicit way to express any intention concerning grouping and querying information. The tools supporting storage of RDF triples have no support for the metrics of knowledge including trust metrics written in RDF and it is not possible to store such metrics in RDF triples directly. Current research addresses the problem of agent trust and focuses on the metrics of graph provenance.

In the thesis we attempt to define the proper functions to equip the RDF graphs with trust metric means and we try to propose a new, compact query language to support this feature.

3 42% of natural languages (including English) belong to this group.

(24)

1.3. Scope and Thesis

In our lives, we deal with many different institutions, companies, schools, etc. in many different relationships, and they each have information about us. However, it is unlikely that any one institution has all the information about us. Each of them gives us access to the data on websites where we are uniquely identified by login, id, etc. We have to use their applications to interact with the information about us, and we have to use their identifiers for us. If we want to create a profile, we must use their custom computer program to search information, copy it, and manage it to suit our needs. It is important to have the possibility to group, browse, aggregate and reuse these data. Information about us comes from the various sources, so it is also valuable that it should be annotated with the degree of trust. The information should be also secure.

The tools supporting storage of RDF data appeared at the beginning of the 21st century. However, there are numerous drawbacks to the proposed solutions. In particular, the main difficulties result from the following:

— There is no support for the trust of knowledge written in the RDF language and the fact that it is not possible to store such metrics in RDF triples.

— There is no optimal selection of the data organisation structure, the existing solutions are also limited in regard to RDF graph encoding.

— There is no support of agent authentication and authorization in the RDF graph stores;

moreover, the existing proposals are incomplete and they do not have formally defined semantics.

— There are no mechanisms to store, access and a select a subgraph in compliance with the Linked Data principles; the existing proposals do not allow for generating views, needed in certain use cases.

— There is no possibility of grouping graphs, which makes it impossible to define graph provenance or trust metrics.

— There are problems with capacity related to data processing and access to data, resulting from non-normalisation of structures.

Referring to the RDF approach, our proposal addresses the issue of security, trust and linking data to each other, so that instead of maintaining various identifiers in different formats, we could essentially build a set of bookmarks to it all and mark the data with trust metrics. Access to the information requires protection, so we propose authentication

(25)

and authorization mechanisms. The purpose of the thesis is to solve the problems of storing, representing and accessing knowledge stored in the RDF form, expanded with new possibilities of expressing trust. To handle these issues we propose:

1. The methods that take into account knowledge reliability, including trust annotations, trust metric and inference rules.

2. The methods that enable better serialization than the methods that are known now, and methods that enable more effective work with RDF statements in a document-oriented store and in collections.

3. The methods for handling subgraph selection, updating, data accessing and respective query interfaces, including graph store management and generating collection views.

4. The decentralized methods for agent identification, such as authentication, authorization and identity delegation.

The thesis of this dissertation is formulated as follows:

The proposed approach solves the following problems:

1. The proposed methods provide inference rules and trust metrics for annotated RDF statements and increased knowledge in RDF.

2. The proposed methods improve storing and grouping RDF graph structures.

3. The proposed query language is more effective in manipulating, selecting and creating views.

4. The proposed mechanisms for access control and identification solve the issue of authentication and authorization in the environment of federated agents.

The proposal constitutes a complete solution for knowledge representation for the Semantic Web and Linked Data environment. Compared to other solutions the proposed approach provides new possibilities for RDF processing without loss of efficiency.

(26)

1.4. Organization of the Work

The dissertation is composed of seven chapters. Chapter 2 introduces the definitions, methods, and structures connected with the Semantic Web. Chapter 3 contains an overview of the state-of-the-art subjects in the domain of the Semantic Web. Chapter 4 proposes data structures for RDF as follows: trust annotations, semistructural document serialization and collections. Chapter 5 discusses matching subgraphs, generating collection views, graph store management, authentication and authorization. Chapter 6 introduces the architecture, the key components of our proposal. Moreover, it contains experiment descriptions and results. Chapter 7 concludes the thesis outlining the main claims and results. Furthermore, perspectives for future research interests are presented.

The dissertation includes two appendices. Appendix A presents query language syntax, and semistructural document syntax. Appendix B shows the testbed implementation.

(27)

2. Basic Concepts

2.1. RDF Triples

An RDF triple consists of a subject, a predicate, and an object. In [38], the meaning of subject, predicate and object is explained. The subject denotes a resource, the predicate means traits or aspects of the resource, and expresses a relationship between the subject and the object. The predicate denotes a binary relation, also known as a property.

Following [38], we provide definitions of RDF triples below.

Definition 2.1 (RDF triple). Assume that I is the set of all Internationalized Resource Identifier (IRI) references, B an infinite set of blank nodes, LS the set of RDF plain literals without a language tag, LL the set of RDF plain literals with a language tag, and LD the set of all RDF typed literals. Let L = LS∪ LL∪ LD, O = I ∪ B ∪ L and S = I ∪ B, then T ⊆ S × I × O is the set of all RDF triples.

If t = ⟨s, p, o⟩ is an RDF triple, s is subject, p predicate and o object.

Example 2.1. In this example we present two RDF triples, which describe a person’s surname.

_:me rdf:type foaf:Person . _:me foaf:surname "Smith" .

Definition 2.2 (Simple interpretation). Let V1 be the RDF vocabulary, V2 the RDFS vocabulary, and V3 the OWL vocabulary, then V = V1∪V2∪V3 is vocabulary that includes the RDF, RDFS and OWL vocabularies. A simple interpretation over vocabulary V is N = ⟨RN, PN, EXTN, SN, LN, LVN⟩, where:

1. RN is a nonempty set of named resources (the universe of N ) RN ⊇ I ∪ LS∪ LL∪ LD, 2. PN is a set, whose elements are named properties, PN ⊆RN,

(28)

3. EXTN is an extension function used to associate properties with their property extension, EXTN ∶PN →2RN×RN,

4. SN is a function used to assign the IRIs to resources or properties, SN ∶ I →RN∪PN, 5. LN is a function used to assign the typed literals to resources, LN ∶ LD →RN, 6. LVN is a subset of literal values, LVN ⊆ LS∪ LL.

Example 2.2. In this example we present a simple interpretation. This interpretation N for vocabulary V , which consists of all names on nodes and edges of the graph, is given by:

RN = {a, b, c, α, β, γ, Ω}

PN = {α, β, γ}

EXTN =α → {⟨a, b⟩}, β → {⟨a, Ω⟩}, γ → {⟨b, c⟩}

SN = :JS → a, foaf:Person → b, foaf:Agent → c LN = ∅

LVN = {Ω}

The interpretation N valuates all three triples of our considered graph with true:

⟨SN(:JS), SN(foaf:Person)⟩ = ⟨a, b⟩ ∈ EXTN(α) = EXTN(SN(rdf:type))

⟨SN(:JS), SN("John Smith")⟩ = ⟨a, Ω⟩ ∈ EXTN(β) = EXTN(SN(foaf:name))

⟨SN(foaf:Person), SN(foaf:Agent)⟩ = ⟨b, c⟩ ∈ EXTN(γ) = EXTN(SN(rdfs:subClassOf))

Therefore, the described graph as a whole is also valued with true. Hence, the simple interpretation N is a model of the graph.

2.2. RDF Graph and RDF Graph Store

A collection of RDF statements intrinsically represents a labeled directed multigraph.

The nodes are the subjects and objects of their triples. In order to indicate that an RDF graph corresponds with T we denote GT.

Definition 2.3 (RDF graph). An RDF graph is a tuple GT = ⟨SO, ARC, f, t, I, larc⟩, where:

1. SO = S ∪ O is a set of vertices,

2. f ∶ ARC → SO is a function which yields the source of each arc, 3. t ∶ ARC → SO is a function which yields the target of each arc, 4. I is a set of IRIs,

5. larc∶ARC → I is a function mapping arcs to IRIs.

(29)

We denote by as,p,o an arc x ∈ ARC so that t(x) = o, f (x) = s and larc(x) = p.

Example 2.3. The example in Figure 2.1 presents an RDF graph of a FOAF [27] profile.

This graph includes four RDF triples:

<#js> rdf:type foaf:Person .

<#js> foaf:name "John Smith" .

<#js> foaf:workplaceHomepage <http://univ.com/> .

<http://univ.com/> rdfs:label "University" .

<#js>

”John Smith”

<http://univ.com/>

University foaf:Person

rdf:type

foaf:name

foaf:workplaceHomepage

rdf:label

Figure 2.1: An RDF graph

The RDF syntax and semantics may be extended to named graphs [31]. The named graph data model is a simple variation of the RDF data model. The basic idea of the model consists of introducing a graph naming mechanism, with the note N G.

Definition 2.4 (Named graph). A named graph N G is a pair ⟨u, GT⟩, where:

1. u ∈ I is a graph name, 2. GT is an RDF graph.

Example 2.4. The example in Figure 2.2 presents a named graph of a FOAF profile.

This graph has the name http://example.com/people and includes three RDF triples:

<http://example.com/people> {

<#js> rdf:type foaf:Person .

<#js> foaf:name "John Smith" .

<#js> foaf:workplaceHomepage <http://univ.com/> . }

(30)

<#people>

<#js>

”John Smith”

<http://univ.com/>

foaf:Person rdf:type

foaf:name

foaf:workplaceHomepage

Figure 2.2: A Named Graph

We assume that N G = {⟨u1, G1T⟩, ⟨u2, G2T⟩, . . . , ⟨un, GnT⟩} is a set of named graphs, where all IRI references are disjoint.

Definition 2.5 (Graph store). An RDF graph store is GS = {GT, N G}, where GT is called a default graph and N G is a set of named graphs.

The RDF graph store should have one default graph and zero or more named graphs.

All graphs in the graph store should be mutable in the sense of being time-dependent.

An entailment regime is an extension of the semantics, which specifies one way for a range of standard entailment relations in the Semantic Web.

Definition 2.6 (Interpretation of an RDF graph store). Let KE be the set of all entailment regime interpretations, which are defined in [51]. The interpretation of an RDF graph store GS over vocabulary V is a pair ⟨N, e⟩, where:

1. N is an interpretation of GT (called the default graph), 2. e is a function mapping V to KE.

Data storage in RDF graph stores is possible due to serialization, that is the process of data model transformation from a graph form into series form meaning a series of characters according to a defined syntax. Such series may be sent by the web and then deserialized, that is reconstructed to the state from before the serialization.

(31)

2.3. Document-Oriented Databases and Query Language

A document-oriented database1 is a database in which every record is written as a semistructural document.

Definition 2.7 (Document-oriented database). A Document-oriented database is a pair D = ⟨D, Sd⟩, where:

1. D is a nonempty set called domain of database, 2. Sd is a set of semistructural documents.

In the Semantic Web most serializations have a semistructured character.

Semistructural documents, which are containers of semistructural data, have the following features: multiple embeddings, complex structures, linked from different sources and the possibility of representation in the form of the directed graph. Any number of fields of any length may be added to such a document. The described databases use the data model in which there is no division into data and the schema. In such databases it is not necessary to use empty fields in a record, which enables the adding of information at any time without any loss of space and changes in the schema. Document-oriented databases contain tags that divide semantic elements, record hierarchies and fields in the data. The architecture of a document-oriented database is presented in Figure 2.3.

Example 2.5. In this example we present a semistructural document.

firstName="John", sureName="Smith", occupation="teacher", age="25",

interest="sailing".

1 Sometimes also called a document database.

(32)

V1 D2/V2

V2 D1

D2

Figure 2.3: Document-oriented database architecture

Document-oriented databases should provide a query language, which is a special-purpose programming language designed for managing data.

Definition 2.8 (Query). Let Q = ⟨H, G, C⟩ be a query, where ⟨HT, GT⟩ is a tableau [4]

(HT and GT are RDF graphs with V variables), C is a set of constraints and D a document-oriented database. The valuation v is a function v ∶ V → S and C ⊆ V. A v(G) as the graph obtained after replacing every occurrence of the variable x in GT by v(x). A matching of the graph GT in database D is the valuation v so that v(G) ⊆ D.

The matchings are those that satisfy the constraints C. Note that, a set of variables V is disjoint with I, B and L.

Example 2.6. In this example we present a query, which matches a person’s surname.

?surname <-

(?person, type, Person), (?person, name, ?surname)

2.4. Semantic Web Environment

The Semantic Web and Linked Data environment is a set of independent agents possessing local resources of decentralised and scattered knowledge. In these concepts, the agent is an entity capable of communicating, monitoring its environment and undertaking autonomous decisions, in order to perform specific tasks ordered by a user. Individual

(33)

entities are associated and have no central point. The role of agents in such an environment is to use the methods of knowledge representation for the purpose of effective semantic exchange of information among each other, users and resources in the web.

Definition 2.9 (Agent). An Agent is a tuple A = ⟨u, GS, O⟩, where:

1. u is a unique IRI, u ∈ I, 2. GS is an RDF graph store,

3. O is a set of operations, which can be executed in the graph store.

Therefore, the graph store is a kind of knowledge base for the agent.

The closed world assumption (CWS) applies to classical databases. It means that only positive facts are recorded in a database. If a fact cannot be proven, a sentence is assumed to be false. The Semantic Web is different, since the open world assumption (OWA) applies here2. It means that in an incomplete situation (if a sentence cannot be proven), an agent cannot conclude that such a sentence is false and should highlight the fact that it is not possible to determine if it is true or false.

2.5. Authentication, Authorization and Trust

In the Semantic Web environment (including graph stores) security of information should be provided. Therefore, access to information depends on authentication and authorization. Authentication and authorization processes are inherent, where authentication should be the first step and authorization the second. Authentication is a process which verifies an identity declared by an agent that has access to the system.

Definition 2.10 (Authentication realm). An authentication realm is a tuple AN =

⟨A, S, K, AK, a, k⟩, where:

1. A is a set of agents, 2. S is a session,

3. K is a key, which performs various cryptographic operations, 4. AK is a set of ordered pairs of agents and keys, AK⊆ A ×K,

5. a ∶ S → A is a function used to assign a single session to each agent, a(si),

2 The SPARQL query language contains features, which have a CWS, e.g. checking the absence of triples is a form of negation, called negation by failure.

(34)

6. k ∶ S → 2K is a function used to associate a session with their keys, ⟨a(si), k⟩ ∈ AK.

Authorization, on the other hand, confirms or fails to confirm the agent’s rights to use the system.

Definition 2.11 (Authorization realm). An authorization realm is a tuple AZ =

⟨A, S, R, P, PR, AR, a, r⟩, where:

1. A is a set of agents, 2. S is a session, 3. R is a role,

4. P is a permission,

5. PR is a set of ordered pairs of permissions and roles, PR⊆P × R, 6. AR is a set of ordered pairs of agents and roles, AR⊆ A × R,

7. a ∶ S → A is a function used to assign a single session to each agent, a(si),

8. r ∶ S → 2R is a function used to associate a session with their roles, ⟨a(si), r⟩ ∈ AR and {p, r} ∈ PR.

A service provider (SP) provides a service that the agent must sign into to use. When the SP depends on a third party for identity authentication, it is often called a relying party (RP), because it relies on the external authentication service. An identity provider (IdP) is another service whose job is to verify or provide information that aids in verifying the identity of the agent. Both authentication and authorization can be based on an OWL ontology and depend on trust.

The concept of trust is relevant in an information area including security issues. Trust means a belief that an entity has competence to perform reliable and safe activities in a specific context. Trust can be defined by a trust metric, which is a measurement of the degree to which one agent trusts another agent. It can be specified as a set, where values having the smallest element represent a higher degree of mistrust, and values close to the largest element represent a higher degree of trust. Pure RDF triples can be seen as the statements in which a trust value equals the largest element.

(35)

Example 2.7. In this example we present two RDF triples with a trust metric, which describes a person’s surname as a trustworthy one.

_:me rdf:type foaf:Person 1 . _:me foaf:surname "Smith" 1 .

(36)

3. Related Work

Now we focus on an overview of the state-of-the-art subjects in the domain of the Semantic Web and Linked Data. There are many works that relate to the issue of the representation, storage, and accessing of data, which we discuss below. In the following chapters we focus on trust metrics and trust annotation, so we discuss these metrics below. Then we focus on RDF graph stores, because in the next chapters we propose a document-oriented graph store. Next, the RDF serializations and query languages are discussed, because in the following chapters we propose a solution in this area. Finally, authentication and authorization methods are described in a web context, which is related to our new proposal discussed in the next chapters.

3.1. Trust Metrics

Classical systems of trust metrics can be divided into two categories: discrete metrics [7, 2] and continuous metrics [118, 56, 81]. The proposal [7] is based on the binary values: trusted or untrusted. Other authors in [2] suggest the use of six values:

distrust, ignorance, minimal, average, good, or complete. Many authors propose that the metric should have intervals, such as [−100; 0) ∪ (0; 100] in [118], [0; 1] in [56] or [−1; 1]

in [81]. There are also metrics based on subjective logic [93] and fuzzy logic [40].

In the Semantic Web works related to trust metrics may be divided into two categories:

those using the Webs of Trust [54, 92] and ontologies [53, 55, 62].

A decentralised approach to trust is defined by security metrics as the range [0; 1]

in [54], [1; 10] or [0.5; 4] in [92]. Solutions [54] and [92] assume information on a graph to be published by third parties. Their weakness is the fact that it is not possible to ensure evaluation in cases of a higher load.

The papers [53, 55] constitute another approach and they use ontologies to ensure trust metrics. In [53] security metrics may only be equal to {0, 1}, which is not sufficient to represent a whole spectrum of trust conditions. However, the authors of [55] extended

(37)

the Friend of a Friend (FOAF) ontology [27] to the following trust values {1, 2, ..., 9}, where 1 means no trust and 9 means absolute trust. Unfortunately, such metrics do not ensure a precise reflection of trust because it is a finite set. Another proposal is Hoonoh ontology [62], which in contrast to [53, 55] allows the application of a continuous scale. However, this solution does not define the maximum and minimum values so it is impossible to determine when there is a state of absolute trust or lack of trust. The weakness of the proposals [54, 92, 53, 55, 62] is the fact that the way in which metrics are stored in RDF triples has not been defined.

On the other hand, there are also proposals for annotation algebra and deductive systems. One paper [82] presents a simple abstract fragment of RDF, which is easy to formalize. Unfortunately, it does not support trust metrics. Yet another approach is shown in [99], which presents a generic framework for representing and reasoning with annotated Semantic Web data. Unfortunately, the authors do not demonstrate inference rules for RDF with trust. What is important is that the approaches presented in [82, 99] do not support OWL properties. In [114], the authors discuss how to annotate the predicate, rather than the triple. Unsatisfactorily, it needs specific algorithms.

3.2. RDF Stores

RDF stores support three methods of data storage: a relational model [32, 76, 86, 29], a graph model [39, 1] and a mixed model [45, 83].

As compared to the relational data model, the RDF model is more expressive.

Contrary to the RDF model, the relational model was designed for simple records – data types, whose structure is known in advance. The relational database scheme is constant and its expandability is small. The query languages in relational databases do not handle paths and characteristics of features such as symmetricalness or functionality. Taking into account all these differences, Jena [32, 76], Sesame [29], RDF API for PHP (RAP) [86]

apply a solution to duplicate predicates. The proposals [32, 76] introduce an RDF store for Java. It provides an application programming interface (API) to extract data and write to RDF graphs. The graphs are represented as an abstract model. A model can be sourced with data from files, databases, URLs or a combination of all of these. Yet another store for querying and analyzing RDF data is proposed in [29]. Sesame can be sourced with data from memory, a native store that stores and retrieves its data directly to/from the disk

(38)

and the relational database. Another solution is presented in [86]. It includes in-memory and database model storage. The memory model implementation stores statements in an array in the system memory and the database model implementation stores statements in a relational database.

The graph model is closer to the RDF model. However, it does not ensure a possibility to divide graphs in collections that would allow a description of their provenance.

Moreover, the native query languages of graph databases are difficult to adapt for RDF data. An example of using this model is the Neo4J RDF store and it is proposed in [39].

Similar to proposals [32, 76, 86], proposal [39] allows for the storage of RDF triples in memory. It is a very quick solution, although in the context of an RDF graph store it may only play the role of cache memory since the data stored are volatile. Yet anouther RDF store using a graph model is presented in [1]. In contrast to a relational database, AllegroGraph considers each stored item to have any number of relationships. It is designed to store RDF tuples.

In [45] Virtuoso a row-wise transaction oriented database is presented. It is re-targeted as an RDF store and inference. It is also revised to column-wise compressed storage and vectored execution. Yet anouther proposal is presented in [83]. Mulgara is not based on a relational database due to the large numbers of table joins encountered by relational systems when dealing with metadata. Instead, it is a database optimized for metadata management.

3.3. Serializations

Being an abstract model, RDF has a few serialization formats. It may be divided into two groups: ones that handle only the RDF models [14, 89, 88] and ones that handle RDF and named graph models [33, 16, 20, 37].

To process a graph in RDF/XML serialization [14], nodes and predicates must be represented in the terms of Extensible Markup Language (XML) – names of elements, names of attributes, contents of elements and values of attributes. This syntax uses qualified names (Qnames) to represent Uniform Resource Identifier (URI) references.

There are certain restrictions imposed on this serialization by the XML format and use of the XML namespace, that prevent the storage of all RDF graphs as some URI references are prohibited by the specifications of these standards. In [33] J. J. Carroll et al. also point

(39)

to other unsolved syntactic issues involving collections, reification and quoting. Moreover, RDF/XML may not be fully described by such schemes as DTD or XML Schema. Another disadvantage of the XML syntax is its incapability of encoding all legal RDF graphs. It does not handle named graphs, while Triples In XML (TriX) serialization [33] does. TriX also uses XML syntax but it is not compatible with [14]. It is aimed at ensuring strong normalisation and compliant representation of RDF graphs in the form of effective use of basic XML tools such as XSLT, XQuery etc. TriX syntax may also use qualified names (QNames) to shorten URI references.

Notation3 (N3) [16] is the next syntax of RDF models designed with human readability in mind. N3 is much more compact and readable than RDF/XML syntax. It is much more compact than [14] and it is characterised by naturality and symmetricity. This proposal has some features such as support for RDF-based rules, which go beyond the serialization of RDF models. In N3 URI abbreviations use prefixes, which are bound to a namespace.

After repetition of another object for the same subject and predicate a comma occurs, and after repetition of another predicate for the same subject a semicolon occurs. Blank node syntax with certain properties inserts the properties between square brackets. Formulae allowing N3 graphs to be quoted within N3 graphs use curly brackets. The proposal [16]

supports named graphs.

Another proposal refers to Terse RDF Triple Language (Turtle) [89], which is a simplification and subset of [16]. This solution offers textual syntax that makes it possible to record RDF graphs in a compact form including abbreviations that use data patterns and types. The drawbacks of this proposal include the fact that it is not capable of handling named graphs and there is a possibility to represent RDF triples in an unnormalised form. The following proposal refers to TriG [20], that is based on [89]

extended by the grouping of triples to multiple named graphs. It has a textual syntax, which offers an alternative syntax for [33]. The authors of this solution define a named graph operator as an equality sign or dot operator for conformance with [16].

N-Triples [88] is also a textual format of RDF serialization. It is a subset of Turtle. This format is very simple and normalised and therefore it is easily generated by the software.

In N-Triples every line of the file represents a single RDF expression. Unfortunately, there are sign restrictions imposed on older version by the US-ASCII standard and it does not

(40)

handle named graphs. However, there is a format based on N-Triples – N-Quards [37], that extends it to named graphs. Sign restrictions are also the drawback of this serialization.

The main disadvantage of the above mentioned serializations is the lack of an API to modify graphs in these documents. However, approaches [89, 16, 20, 37] are not well adjusted to the software since they may not be used together with standard libraries or programming languages and they require a non-standard parser. What is more, proposals [16, 20, 37] do not have any generally known registered media type while serializations [14, 20] are difficult to be processed since they constitute lengthy data exchange formats. A relationship between presented serializations is presented in Figure 3.1.

Notation3

TriG Turtle

RDF/XML

N-Triples

N-Quads

TriX

Figure 3.1: Relationship between RDF serializations. Solid arrows mean closer relations and dashed arrows mean weak relations.

3.4. Matching Subgraph and Access to Store

Classical web services focus on the Remote Procedure Call (RPC) protocol.

XML-RPC [72] is the most popular protocol based on the RPC style. It uses the XML language in the contents of requests and answers, which is a drawback since exchanged messages are very lengthy. JSON-RPC [100] is another RPC approach. Its requests and answers are encoded in JavaScript Object Notation (JSON). The remote method may be initiated by a message sent to the web service by means of a Hypertext Transfer Protocol

(41)

(HTTP) or socket. Unfortunately, services [72, 100] must map requests directly to a given function of a programming language.

Simple Object Access Protocol (SOAP) [36] supersedes XML-RPC. It is a structured data exchange protocol used to perform different online services. It is based on XML as a message format and usually uses HTTP for transmission. Unfortunately, SOAP is very complex and the XML language is not easy to use so the communication in this protocol is long and complicated. What is more, it does not take account of numerous possibilities existing in HTTP, such as authentication, buffering and negotiating types of contents.

Protocol [36] is currently used in store access solutions.

In the Semantic Web, the proposals of query languages take a lot from the traditional database languages, such as SQL or OQL. The first solution is RQL [66] that contains a set of basic queries and iterators and is similar to OQL. In RQL access to data and schema can be combined in all manners. In this proposal data can be retrieved by their types, by navigating to the appropriate position in the RDF graph. Regrettably, it can be criticised for its large number of features and choice of syntactic constructs, which result in the simplifications of SeRQL [28]. This proposal is more accessible than [66] because of several types of shorthand such as object-property or object lists. Another derivative of [66] is eRQL [117], a radical simplification based on a keyword-based interface. Other ideas are the SquishQL and RDQL [78] languages that are similar to SQL and use patterns and terms of subgraph matching filters. TriQL [21] extends [78] by constructs supporting querying of named graphs, which is introduced in [20]. In [21] a query is evaluated not against a single set of triples but rather against a set of such sets. The weakness of solutions [66, 28, 117, 78, 21] is only the subgraph matching option without the possibility of store management and no defined store access protocols. SPARQL [61, 50] is the most developed query language. It combines the features of [78] while adding the possibility of boolean queries and graph construction. In its new version [50] SPARQL also handles graph management. However, the weakness of this solution is the use of [36] as an access protocol. Yet another SQL-like proposal is RSQL [74]. It provides the clause to define abbreviations for long namespaces and it uses the logical operators and, or and not as constraints. A relationship between the above presented serializations is presented in Figure 3.2.

(42)

SquisQL RDQL

TriQL SPARQL

RQL

SeRQL eRQL

Figure 3.2: Relationship between RDF query languages. Solid arrows mean closer relations and dashed arrows mean weak relations.

There are also a lot of proposals, which are not OQL or SQL-like. One can distinguish three main categories of query language: reactive rule [65, 83], pattern-based [98, 95], and navigational access [75]. Algae [65] is based on N-triples. It extends [89] serialization with actions and constraints. Another reactive rule query language is iTQL [83]. It provides querying, update, and transaction management functionality. TRIPLE [98] is a query, inference, and transformation language. Authors of [98] introduce syntax close to F-Logic. Yet another pattern-based query language is Xcerpt [95]. RPath [75] is one more RDF query language, which uses XPath for data access. Unfortunately, proposals [95, 75]

support only RDF, which uses XML serialization.

3.5. Authentication and Authorization Methods

In classical databases one can distinguish three main categories of security: user authentication for querying the database results [84, 73], an access control approach [19, 8] and encrypting datasets [59]. The authentication of query results is divided into subcategories: cryptographic primitives based on a digital signature [84] and Merkle Hash Trees [73]. The access control approach consists of two subcategories: content-based access control [19] and rule-based access control [8].

For semi-structured databases, which are more suited to web applications, the main research concentrates on access control [41, 18, 64]. In [41] access restrictions on the structure and content are defined. Others [18] present varying protection granularity levels. The access control policy is proposed by using a 5-tuple of subject, object, privilege,

(43)

propagation option and signed access decision. In [64] an access control model provides provisional authorization.

On the other hand, in web applications there are database independent and resource-oriented solutions, which focus on the decentralized manner of the authentication and/or authorization procedures. Good examples are OpenID [90] and OAuth Protocol [60]. The idea of OpenID consists of assigning a unique ID to a resource, which takes the form of a unique URL, and is managed by an OpenID provider, which handles the authentication procedure. While OpenID is all about using a single identity to sign into many sites, OAuth is about giving access to a resource without sharing identity at all. The OAuth protocol provides authentication, authorization and data integrity.

Another proposal is BrowserID [58]. It uses an email address as a unique identifier rather than the URIs used in [90]. Unfortunately, it was developed for web browsers and it is difficult to use with RDF graph stores. Yet another proposal in this area is provided in [30] by means of Security Assertion Markup Language (SAML), which is a standard for exchanging authentication and authorization data between the identity providers and service providers. Another proposal, which is extended to access control support, is XACML [79]. It is a language based on XML for security policies and access decisions. XACML provides security administrators with the functionality to describe an access control policy once, without having to rewrite it numerous times in different application-specific languages. While this approach may be sufficient for graph stores using the XML syntax for storing data, it is not satisfactory for other RDF graph stores.

In the context of the Semantic Web there are also new proposals for authorization solutions. The papers [91, 3] define policy-based access control. [91] defines policies that describe subgraphs on which various operations, such as insert, remove, update and read, are identified by specifying RDF patterns. The authors define a set of policy rules enforced by a policy engine to make the authorization decisions. Unfortunately, they do not discuss the semantics in a formal way. In [3] triples are not annotated with accessibility information, while the enforcement mechanism is query-based. The policy permissions are injected into the query in order to ensure that the triples obtained are only the accessible ones. Unfortunately, the semantics of the policy are not formally defined in [3]. Yet another approach is provided in [68, 63], which respect RDFS entailments.

In [68] the authors discuss how conflicts can be resolved using the RDFS subsumption

(44)

hierarchies. This proposal, as in [91], requires instantiating RDF patterns. In [63]

inferences are computed for an RDF dataset without revealing information that might have been explicitly not permitted.

(45)

4. Data Structures for RDF

Having presented the basics and the state-of-the-art Semantic Web and RDF, now we focus on new data structures for RDF. Section 4.1 presents trust metrics and the way they are represented in RDF triples. The subjects discussed in this section have been published in [112]. Section 4.2 discusses collections with provenance metrics and a document-oriented graph store with document serialization. The subjects discussed in this section have been published in [103, 102, 105, 101].

4.1. Representing Trust Metrics with Annotations

One of the main problems of the Semantic Web environment is to define the trust level of different RDF triples. Taking into account that the Semantic Web environment is enormous and everybody may contribute to it and access it, a question should be asked how much information can users trust from the information providers. Trust metrics and annotating RDF statements are solutions to this problem.

In this section we propose annotated RDF, trust metrics and inference principles. We also present ways to extend the query language to trust metrics and introduce mapping algorithms to the pure RDF model.

Trust metrics are required by both the RDF statements and the annotations of the whole collections of RDF statements. In the first instance a metric is a measure of the agent’s trust degree to the RDF triple. On account of such annotations, the agents may make inferences calculating new trust metrics. In the second instance, when calculating new metrics, they annotate whole groups of RDF statements. This instance has been described in Section 4.2. The values of trust metrics can be the result of social interaction by agents.

(46)

4.1.1. Preliminaries

In this subsection we introduce RDF interpretation and semantic conditions, which should be satisfied, as the basis for extended interpretation. RDF consists of resources, properties, literals and classes. All the relationships between these parts are presented in Figure 4.1.

Things individuals

Resource individuals

Literal data values

Property

properties Class

classes

ObjectProperty properties

DatatypeProperty properties

Figure 4.1: Hierarchy of RDF semantics

Let us denote by ρ a subset of the vocabularies of RDF [87], RDFS [26] and OWL [96]

as follows: ρ = {rn, dm, tp, spo, sco, sa, df, io, ep, ec, pdw, dw}, where rn is range, dm is domain, tp is type, spo is subPropertyOf, sco is subClassOf, sa is sameAs, df is differentFrom, io is inverseOf, ep is equivalentClass, ec is equivalentClass, pdw is propertyDisjointWith and dw is disjointWith.

Following [26, 96, 87] we introduce the concept of class. The set of classes CN is defined as CN = {x ∈ RN ∶ ⟨x, SN(C)⟩ ∈ EXTN(SN(tp))}, where RN is a nonempty set of named resources (see Definition 2.2). Additionally, the mapping CEXTN ∶CN → 2RN is defined as CEXTN(c) = {x ∈ RN ∶ ⟨x, c⟩ ∈ EXTN(SN(tp))}, where EXTN is an extension function used to associate properties with their property extension (see Definition 2.2).

Note that SN(x) ∈ CN and x ∈ {p, r, l, c}, where p is Property IRI, r is Resource IRI, l is Literal IRI and c is Class IRI. Let RN = CEXTN(r), PN = CEXTN(p), CN = CEXTN(c) and LVN = CEXTN(l), then a simple interpretation satisfies the following semantic conditions:

1. for properties:

Cytaty

Powiązane dokumenty

Attributes that can be obtained from the microelec- trode recorded signal can be most generally divided into two groups: based upon spike occurrence and

Kierunek Zarządzanie i inżynieria produkcji będzie podejmować zagadnienia menedżerskie w najnowocześ- niejszym ujęciu inżynierskim data science, kształcąc na I 

Введення ж процедури послідовного раптового підвищення швидкості деформації пластичного матеріалу за одної і тієї ж величини деформації

Szczególnie istotnym zagadnieniem jest opracowanie scenariuszy testowych adekwatnych do przyszłych zastosowa´n badanego oprogramowania. Autor przedstawił mo˙zliwie szeroki

W przypadku zmieszanych odpadów komunalnych, zanim trafią one do części biologicznej instalacji, muszą zostać poddane obróbce mechanicznej. Znajduje się tam

An increase in truss height (in the case of the G1 geometry) reduces the value of the nodal load, which results from the evident reduction of compression force in the upper chord and

Like a cellular frame container for liquid freight, a container for transportation freight in packag- ing is designed with the only difference that instead of tanks with liquid

Moreover, between the sliding and laconic image of a revived square and a typical description of one day in the life of a dandy, a conditional boundary was drawn, formally