Load tests - Operations on RDF Graph Store

5. Operations on RDF Graph Store

6.1. Load tests

In this section we present the load tests we performed on the Berlin SPARQL Benchmark [23]. We also discuss the results of these tests. Moreover, we compare our testbed to other graph stores and a relational database.

According to [23], the BSBM benchmark is settled in an e-commerce use case in which a set of products is offered by different vendors and consumers who have posted reviews about these products on various review sites. The benchmark defines an abstract data model for this use case together with data production rules that allow benchmark datasets to be scaled to arbitrary sizes using the number of products as a scale factor. We enrich BSBM to trust metrics, which are presented in Section 4.1.

The data model contains the following classes: Product, ProductType, ProductFeature, Producer, Vendor, Offer, Review, and Person. An important factor is the instances of Products, which are described by a label and a comment RDFS properties. Products have between 3 and 5 textual properties. These properties have property values ranging from 1 to 2000 with a normal distribution.

Products (Product class) have a type that is part of a type hierarchy (ProductType class). The depth of this subsumption hierarchy depends on the chosen scale factor.

The depth of the hierarchy is calculated as d = round(log10(n))/2 + 1, where n is the number of instances of Product class. Products have a variable number of product features (ProductFeature class). Two products that share the same product type share the same set of possible product features. Products are produced by producers (Producer class).

The number of products per producer follows a normal distribution with a mean of µ = 50 and a standard deviation of σ = 16.6. Products are offered by vendors (Vendor class).

Offers (Offer class) are valid for a specific period and contain a price [5; 10000] and the number of days it takes to deliver the product [1; 21]. Offers are distributed over products

1 http://mongodb.org/

using a normal distribution with the parameters µ = n/2 and σ = n/4. The number of offers per vendor follows a normal distribution with the parameters µ = 2000 and σ = 667.

Reviews (Review class) consist of a title and a review text between 50 and 300 words.

Reviews have up to 4 ratings with a random integer value between 1 and 10.

The load experiment measures the time required to load on the testbed, a Virtuoso Open-Source Edition 6.1 graph store² and a MySQL 5.5 database³. The most important factor in the presented load tests is the number of statements, which are in the form of triples or quads. The data description which was used in the experiment, is presented in Table 6.1.

No. of No. of No. of No. of No. of No. of quads products features offers persons reviews

5007 10 289 200 6 100

8498 20 289 400 10 200

12022 30 289 600 16 300

16981 40 569 800 21 400

22729 50 999 1000 26 500

26263 60 999 1200 30 600

29847 70 999 1400 37 700

33354 80 999 1600 41 800

36879 90 999 1800 46 900

40377 100 999 2000 50 1000

Table 6.1: Load: dataset description

In order to compare our testbed not only to common graph stores but also to the SQL databases, we define two concrete representations of the abstract model: an RDF representation and a relational representation, which are based on BSBM. All plots show the arithmetic mean load time from 10 runs.

2 http://virtuoso.openlinksw.com/

3 http://www.mysql.com/

1 2 3 4

Figure 6.2: Load test: Testbed comparison to Virtuoso

We tested different types of serializations in the testbed and Virtuoso. Figure 6.2 presents the results of the experiment, in which we load data on both, the testbed (Fig. 6.2a.) and Virtuoso (Fig. 6.2b.). Fig. 6.2a. shows a comparison of loading two different serializations of the RDFJD document into the testbed. The normalized RDFJD is the faster one, in particular for loading normalized RDFJD the times are approximately 1.8 times faster than unnormalized RDFJD. Fig. 6.2b. presents the comparison of loading two different approaches: basic RDF and trust-driven reificated statements⁴, which uses Algorithm 4.1. The approaches are loaded into Virtuoso. The results are not surprising;

the loading of basic RDF is faster than the reificated one. The load times of basic RDF are approximately 4.7 times faster than reificated RDF. In Fig. 6.2a., with an increasing number of quads, the load time quickly grows. In Fig. 6.2b. the growth rate is lower.

4 Both approaches use Turtle syntax.

10^3.8 10⁴ 10^4.2 10^4.4 10^4.6

Figure 6.3: Load test: Testbed comparison to Virtuoso and MySQL (log-log plot)

We also tested MySQL and compared the testbed to Virtuoso. Figure 6.3 presents the results of the experiment, in which we compare the testbed to Virtuoso (Fig. 6.3a.) and MySQL (Fig. 6.3b.). In Fig. 6.3a. we show the loading of normalized RDFJD serialization, which has the best load time (see Fig. 6.2a.) into the testbed and basic RDF without a trust metric, which has the best load time (see Fig. 6.2b.) into Virtuoso.

This plot shows that loading triples into Virtuoso is much faster than loading quads into the testbed. For the loading 40000 statements Virtuoso is up to 60 times faster.

The testbed times are nearly quadratic to the number of quads and the coefficient of determination R² ≈ 0.99. The Virtuoso times are nearly linear to the number of triples and the coefficient of determination R² ≈0.98. In Fig. 6.3b. we present the loading of normalized RDFJD serialization into the testbed and equivalent records to the generated triples into MySQL⁵. This plot shows that loading data into MySQL is much faster than loading quads into the testbed. At 40000 statements MySQL is up to 110 times faster.

The MySQL times are nearly linear to the number of triples and R² ≈0.94.

5 Note that the measurement includes Data Definition Language times.

10^3.7 10^3.8 10^3.9 10⁴ 10^4.1 10^4.2 10^4.3 10^4.4 10^4.5 10^4.6 10⁰

10¹ 10²

Number of quads

Timesec

textual RDFJD binary RDFJD

Figure 6.4: Load test: normalized textual RDFJD comparison to binary RDFJD (log-log plot)

Taking into consideration that the times of textual RDFJD are nearly quadratic, we propose binary representation of RDFJD. Compared to textual RDFJD, binary RDFJD is designed to be efficient both in storage space and scan-speed. Our proposal represents data types in little-endian format. Large elements are prefixed with a length field to facilitate scanning. Figure 6.4 presents the results of the experiment, in which we compare loading textual RDFJD and binary RDFJD into the testbed. This plot shows that loading quads in binary serialization is much faster than loading quads in textual serialization. Moreover, binary RDFJD is more scalable because of nearly linear times with R² =0.97. At 40000 quads the loading of binary RDFJD is up to 140 times faster than loading the textual version.

1 2 3 4

Figure 6.5: Load test: Testbed with binary RDFJD comparison to Virtuoso and MySQL

We also compared the testbed with a binary serialization to Virtuoso and MySQL.

Figure 6.5 presents the results of the experiment, in which we compare the testbed with binary serialization to Virtuoso (Fig. 6.5a.) and MySQL (Fig. 6.5b.). In Fig. 6.5a. we show the loading of binary normalized RDFJD serialization into the testbed and loading both basic RDF without a trust metric and trust-driven reificated RDF. This plot shows that loading statements into the testbed is much faster than loading statements into Virtuoso. The load times of the testbed with binary serialization are approximately 2.4 times faster than the load times of Virtuoso without trust metrics. At 40000 statements the loading of binary RDFJD into the testbed is up to 10 times faster than the loading of RDF with trust metrics into Virtuoso. In Fig. 6.5b. we present the loading of binary normalized RDFJD serialization into the testbed, and loading equivalent records to the generated triples into MySQL. This plot shows that loading quads into the testbed is much faster than loading equivalent records into the relational database. The load times of binary serialization into the testbed are approximately 1.4 times faster than records into MySQL.

The general observation of our experiments is that the evaluation times in most cases are nearly linear to the number of statements considered. Some of the evaluation times are nearly quadratic to the number of quads. The best result belongs to the testbed with

binary serialization despite the added trust metrics. Furthermore, one can see that the addition of trust metrics does not slow down the existing graph stores too much.

W dokumencie Politechnika Warszawska Warsaw University of Technology (Stron 93-99)