ParlBench is a scalable RDF benchmark modelling a large scale electronic publishing scenario. The benchmark suite consists of the data sets and a set of queries.
The benchmark data sets include (1) the Dutch parliamentary proceedings, (2) political parties, (3) politicians, (4) paragraphs with textual content of the proceedings and (5) tagged entities which provide links from the paragraphs to DBpedia. The data is real, but free of intellectual property rights issues.
The benchmark defines 19 analytical queries covering wide range of SPARQL constructs. The queries can be viewed as coming from one of two use cases: create a report or perform a scientific research. To enable more comprehensive analysis of performances of RDF stores', the benchmark queries are grouped into four micro benchmarks with respect to the their analytical aims: Average, Count, Factual and Top 10.
The benchmark data sets are described in the ParlBench benchmark in more detail.
For the experiment we used a personal laptop Apple MacBook Pro with the following specification:
The benchmark was run on the OpenLink Virtuoso RDF native store (Open Source Edition). We used Virtuoso v.06.01.3127 compiled from source for OS X. Default RDF index scheme configuration of Virtuoso, i.e., the scheme consisted of the following indices:
Loading time is the time to load an RDF data set to an RDF store system.
All the data sets are in RDF/XML representation. We measured time in seconds by loading one data sets from the test collection at a time. Members and parties are loaded entirely using the Virtuoso RDF bulk load procedure. Proceedings, Paragraphs and Tagged Entities are loaded using the Virtuoso function to load large RDF/XML text DB.DBA.RDF_LOAD_RDFXML_MT.
Query response time is the time it takes to execute a SPARQL query.
We measured the query response time in warm runs, i.e., after executing the benchmark queries several times on the Virtuoso server, in order for Virtuoso to store data in its cache memory. We set the number of warm runs to be 5. Each run consisted of the 10 permutations of the benchmark queries, i.e., 190 queries in total. The total number of queries executed in the warm up stage is 950.After warming up Virtuoso, we run the permutations for 3 times and measured the average execution time of each query across these 3 runs. Each query was run 30 times in total.
For the exeprimental run, we generated the test collection that consists of the members and parties data sets and 1% of proceedings together with paragraphs and tagged entities. Note, that the members, parties and proceedings data sets are required for answering the benchmark queries.
|dataset||members||parties||1% of proceedings togetder witd paragraphs and tagged entities|
|number of triples||33,892||510||953,703|
To generate the test collection and load it into Virtuoso, we used the script loadRDF.sh. The input of the script is the path to the Benchmark data sets. For the correct run of the script it is important to keep the structure of the folders as it is after unpacking the downloaded data sets.
The following command was used to run the script on the experimental machine:
$ ./loadRDF.sh ./input-data/The following is the output of running the command above:
1 $ Do you want to load members? (y/n) y
2 $ Do you want to load parties? (y/n) y
3 $ Do you want to load proceedings? (y/n) y
4 $ Enter the number (integer) indicating what percent of proceedings you want to load? 1
5 $ Do you want to load paragraphs? (y/n) y
6 $ Do you want to load tagged entities? (y/n) y
7 $ Do you want to load proceedings? (y/n) y
Note, that the number indicating what percent of proceedings you want to load (line 4 of the output above) must be integer.
We created 10 permutations of the benchmark queries. Each permutation consists of all 19 ParlBench queries.
To create permutations we used the script createPermutations.sh. This script takes as input the number of permutations to be created. The script requires the folder queries with the benchmark queries to be in the same directory. The script creates the folder permutations in the current directory.The following is the command we run to generate 10 permutations.
$ ./createPermutations.sh 10
To run the queries, we used the script run-queries.sh. The script takes two input parameters: the number of runs of permutations to warm up, and the number of runs to measure average execution time of queries. The script requires the folder permutations, created by the previous step to be in the same directory.The following command was used to run the script run-queries.sh on the experimental machine:
$ ./run-queries.sh 5 3
The script output is:
|data set||members||parties||1% of proceedings||paragraphs||tagged entities|
|number of triples||33,885||510||442,849||147,830||363,024|
|loading time (sec)||0.190||5.420||22.179||21854.030||2424.558|