Document-oriented Graph Store - Data Structures for RDF

4. Data Structures for RDF

4.2. Document-oriented Graph Store

One of the most important elements of the discussed environment is RDF data representation and storage. Taking these interrelations into account, one should think of an effective RDF graph store, the structure of its documents and the possibility to group RDF statements.

This section presents the document-oriented graph store, document collections and their serialization. We also propose algorithms for the purpose of document generation and their normalization, as well as the ways in which our solution may be mapped into the Named Graph model.

The section implements the trust metrics presented in Section 4.1, into the collections in the document-oriented graph store. The value of trust of a collection is calculated on the dedicated RDF quads of other agents, which are associated with the given collection.

If a collection has RDF triples without trust metrics, which cannot be determined by the inference rules, the fourth value is propagated from the collection in which these RDF statements are present.

4.2.1. Data Collections in Document-oriented Graph Store

In this subsection we propose a document-oriented graph store that is not bound to any predefined database types. Instead, it is close to the RDF data, so that no predefined structure is needed, which is very flexible. Moreover, this approach is consistent with the idea of Linked Data. The graph store can be thought of as a store including containers so called data collections or simply collections. A data collection is similar to a relation from relational databases and a class from object databases. A collection is represented by a graph, provenance and trust metric. These collections include documents and documents store serialized RDF statements. The concept of a document is a central element of the graph store. The documents consist of RDF data. For the sake of generality in our considerations, we define here a document as an ordered set of keys with associated values,

which can be one of several different datatypes. Figure 4.5 illustrates the proposed data model.

Hence, a collection can be seen as a group of RDF triples and quads (representing documents).

Definition 4.6 (RDF graph over RDF quads and triples). An RDF graph over RDF quads and triples is an extension of an RDF graph (see Definition 4.2) over Q (see Definition 4.1) and T (see Definition 2.1) and it is a tuple G_QT =

⟨SO^QT, ARC^QT, f^QT, t^QT, I^QT, l_arc^QT, M^QT, l^QT_arcM⟩, where: SO^QT = S ∪ O; ARC^QT = Q;

f^QT = P roj_4∣Q¹ ∪P roj_3∣T¹ ; t^QT = P roj_4∣Q³ ∪P roj_3∣T³ ; I^QT = I; l^QTarc = P roj_4∣Q² ∪P roj_3∣T² ; M^QT = M; l^QT_arcM =P roj_4∣Q⁴ .

Definition 4.7 (Collection). A collection is a tuple C = ⟨r, v, G_QT⟩, where:

1. r ∈ I is the provenance of a graph, which can be interpreted as IRI, 2. v ∈ M is a trust metric,

3. GQT is an RDF graph over Q and T .

Provenance provides information about a graph’s origin, such as who created it, when it was modifed, or how it was created. It is used for building representations of entities, involved in producing a piece of data. Special metrics provide information about RDF graph trustworthiness.

Example 4.10. In this example we present a collection tuple.

⟨ http://example.org/users, 0.9, ( :a foaf:name ”John Smith” 1) ( :a foaf:age 29 0.3) ⟩

For the needs of the graph store we extend the concept from Subsection 4.1.1, where we define mainly the collection class. We propose S^N(x) ∈ C^N, where x is Collection IRI (see Subsection 4.1.1. Let N be an RDF simple interpretation, so that the following conditions are retained:

1. {Ci∶i = 1, 2, . . . n} ∩ LV^N = ∅,

2. ⟨S^N(u_i), S^N(x)⟩ ∈ EXT^N(tp) ⇒ S^N(u_i) =C_i.

Definition 4.8 (Document-oriented graph store). A document-oriented graph store is GS_D = {C₁, C₂, . . . , C_i}, where every C_i is a collection, i ≥ 1.

Definition 4.9 (Interpretation of a document-oriented graph store). The vocabulary for a document-oriented graph store GS_D is defined as V_G= {u_i ∶i = 1, 2, . . . , n} ∪ V , where V is the RDF vocabulary and u is an IRI reference. The interpretation of a document-oriented graph store over vocabulary V_G is a pair ⟨E, h⟩, where:

1. E is an interpretation of M-annotation (see Subsection 4.1.2),

2. h is a function mapping V_G to K_E, where K_E is the set of all entailment regime interpretations [51].

Figure 4.5: Document-oriented graph store model

4.2.2. Document Serialization

In this subsection we introduce a concept of RDF in JSON Document (in the following sections denoted by RDFJD) and their serialization. Serialization is the process of converting a data structure into a format that can be stored and transmitted across the web and reconstructed later in the same or another computer environment. We define a document as a resource that serves as the container of semistructural data. One of the semistructural data formats is JavaScript Object Notation (JSON) [34], which is a syntax designed for human-readable data interchange and easy for machines to generate. It uses both simple datatypes, such as number, string or boolean and composite data types, such as array and object.

We propose serialization based on JSON, which is equivalent to the RDF model, which is more legible and brief. The proposal is a lightweight textual syntax that can easily be modified by humans, servers and clients. The advantage of this syntax is that it can easily be transformed from other syntaxes. Another benefit of serializing RDF graphs in JSON is that there are many software libraries and built-in functions, which support the serialization.

The difference between regular JSON and RDFJD is that the above RDFJD object uniquely identifies itself on the World Wide Web and can be used, without introducing ambiguity across the Web Service using a document-oriented graph store.

Definition 4.10 (Structure). A proposed structure is a tuple S = ⟨E, id, R⟩, where:

1. E is a set of elementary data, E = {ei ∶i ∈ Z}, 2. id is an identifier of structure.

3. R is a union of input relations {id} × E = {⟨id, ei⟩ ∶ i ∈ Z} and output relations E × E ⊃ {⟨eⁱ, ei⟩ ∶i ∈ Z}.

The proposed structure can be modeled as a set of an abstract data structure with two operations:

1. U = get(D, Y) – the data stored in the element of the data type D whose indices are the key Y.

2. set(D, Y, U ) – causes key Y and value U to be stored at data type D.

We propose two types of RDFJD documents:

1. directive document , which expresses the context of statement documents, 2. statement document , which expresses RDF triples with trust.

Definition 4.11 (Directive document). A directive document is associated with a collection and implements the proposed trust metric, provenance, and defines the short-hand names that are used throughout an RDFJD statement document.

The directive document is a metadata package of a collection. This document should be unique. All the possible keys in a directive document are presented in Table 4.2. The trust metric and provenance keys should impose a unique key constraint.

Definition 4.12 (Statement document). A statement document is the main part, which stores RDF statements with trust metrics.

Key Description

_trust a predefined value of collection trust _prov a predefined value of collection provenance prefix ID abbreviating IRIs

Table 4.2: RDFJD directive document keys

Example 4.11. In this example we present a directive document. The RDFJD document contains fields which define the provenance (http://example.org/g1) and trust (0.9) of the collection. It also defines a foaf prefix as an abbreviation for http://xmlns.com/foaf/0.1/.

{

"_prov": "http://example.org/g1",

"_trust": 0.9,

"foaf": "http://xmlns.com/foaf/0.1/"

}

A statement document uses subject-centric syntax, and it represents one or more properties of a subject. Often these documents occur more than once. They implement the subject and trust metric as predefined keys, predicates as keys and objects as values.

Plain literals with a language tag and typed literals are supported by special predefined keys. All the possible keys in a statement document are presented in Table 4.3. A blank node is prefixed by an underscore and a dot. All prefixes of abbreviated IRIs use dot.

The RDFJD document grammar is defined in Appendix A.

Key Description

_subject Used to identify a subject that is being described _type Used to set the datatype of a subject

predicate key Used to describe an object

Possible values of predicate key

_value Used to specify the data that is associated with a particular predicate _lang Used to specify the native language for a particular object

_datatype Used to specify the datatype for a particular object _trust Used to set the trust metric

Table 4.3: RDFJD statement document keys

Example 4.12. In this example we present a statement document. The key foaf.name is expanded to a value from a directive document (see Example 4.11). The RDFJD document contains fields, which define RDF quads:

1. quad 1:

1.1 subject: http://example.com/voc#me,

1.2 predicate: http://www.w3.org/1999/02/22-rdf-syntax-ns#type, 1.3 object: http://example.com/voc#Teacher,

1.4 trust metric: 0.9, 2. quad 2:

2.1 subject: http://example.com/voc#me,

2.2 predicate: http://xmlns.com/foaf/0.1/name, 2.3 object: John Smith,

2.4 trust metric: 0.5.

{

"_subject": "http://example.com/voc#me",

"_type": {

"_value": "http://example.com/voc#Teacher",

"_trust": 0.9 },

"foaf.name": {

"_value": "John Smith",

"_trust": 0.5 }

}

4.2.3. Generating Documents

In this subsection we propose two algorithms to generate an RDFJD directive document and a statement document.

Algorithm 4.3 presents the process of generating the RDFJD directive document. The algorithm determines the trust value and provenance value for the data collection. A trust value is determined and placed as a summary of the typical values from RDF associated statements in existing agents. A provenance value is built up of the following components:

1. the protocol scheme supported by a graph store,

2. the authority (user information, host and port) supported by a graph store, 3. the unique path that serves to identify a collection.

input : set of agent A

output: directive document DD

1 create global metric gm;

2 foreach a ∈ A do

3 get trust metric m from associated statements of a;

4 add m to gm;

5 insert median(gm) into ” trust” key;

6 get store IRI scheme name sn;

7 get store IRI authority sa;

8 insert generate unique path(sn, sa) into ” prov” key;

9 insert predefined prefixes;

Algorithm 4.3: Generating a directive document

Algorithm 4.4 shows the process of generating an RDFJD statement document. The algorithm creates triples with trust metrics. The algorithm takes into account simple literals without a language tag, simple literals with a language tag and typed literals.

input : set of quads Q

output: set of statement documents SD

1 create root object;

2 foreach q ∈ Q do

3 get subject s from q;

4 insert s into ” subject” key;

5 get predicate p from q;

6 get object o from q;

7 if equal(p, ”rdf:type”) then

8 create ” type” key;

9 insert o into ” type” key;

10 else

11 add prefix(p) to directive document;

12 create key abbreviation(p);

13 if o is literal without language tag then

14 insert o into abbreviation(p) key;

15 else if o is literal with language tag then

16 create ” value” key in abbreviation(p) key;

17 insert o into ” value” key;

18 get language lg from o;

19 create ” language” key in abbreviation(p) key;

20 insert lg into ” language” key;

21 else

22 create ” value” key in abbreviation(p) key;

23 insert o into ” value” key;

24 get datatype dt from o;

25 create ” datatype” key in abbreviation(p) key;

26 insert dt into ” datatype” key;

27 get trust m from q;

28 insert m into ” trust” key;

Algorithm 4.4: Generating a statement document

There is the possibility that the same subject could occur in different RDFJD statement documents (e.g. because of the insertion of new statements). To improve the speed of data retrieval operations on a subject-centric statement there is the necessity to merge two or more statement documents with the same subject. Algorithm 4.5 presents the process of merging RDFJD documents. After this action an index [48] may be applied to the subject.

input : set of statement document SD output: statement document SD_M

1 SDt ← sort(SD);

2 foreach s ∈ SDt do

3 if equal(current(), next()) then

4 merge(current(), next());

Algorithm 4.5: Merging statement documents

The first part of Example 4.13 presents two RDFJD statement documents. All documents have the _subject key and one statement. After merging these two documents there is a document with one _subject key and two statements, which is presented in the second part of Example 4.13.

4.2.4. Mapping into Named Graphs

In this subsection the mapping from our approach to the named graph model [31] is presented. It is important because of compatibility with existing systems.

Collections C = ⟨r, ∅, G_T⟩ are equivalent to named graphs N G = ⟨n, G_T⟩, where n ∈ I (called name). In the case where C = ⟨r, v, G_T⟩ we propose to use STO (see Subsection 4.1.4). Algorithm 4.6 presents the process of transformation, which uses Named Graphs.

The RDF graph store can also be mapped to an RDF dataset. Following [61], RDF dataset DS consists of one graph, called the default graph, which does not have a name, and zero or more named graphs, each identified by IRI, DS = {G, ⟨u₁, G₁⟩, ⟨u₂, G₂⟩, . . . , ⟨u_i, G_i⟩}. If we define the document-oriented graph store as GS_D = {C₁, C₂, . . . , C_i}and C₁= ⟨∅, ∅, G_T⟩then DS is equivalent to GS_D. Otherwise, we suggest mapping from C_i to ⟨u_i+1, ∅, G_i+1⟩ and use Algorithm 4.6.

Example 4.13. In this part we present two RDFJD statement documents before normalization. All documents have the _subject key and one statement.

{ "_subject": "http://example.com/voc#me",

In this part we present two RDFJD statement documents after normalization. After merging these two documents there is a document with one _subject key and two statements.

output: Named graph N G, Default graph G_T

1 get r from C;

2 get v from C;

3 create N G with r as a name;

4 create default graph G_T;

5 foreach q ∈ C do

6 get triple t from quad;

7 insert t into N G;

Algorithm 4.6: Mapping to Named Graphs

Example 4.14. In this example we present mapping into a named graph. This example is provided in TriG [20].

metric:prov {

<> foaf:name "John Smith" . }

metric:prov sto:trustValue "0.5"^^xsd:float .

W dokumencie Politechnika Warszawska Warsaw University of Technology (Stron 58-69)