XML Prague
@ University of Economics, Prague
2024-06-08
Senior Engineer @Evolved Binary
Databases, Concurrency, and Performance
Developer
Java
C++
XML and XQuery
Contributor to eXist-db since 2021
What does it look like now?
Taxonomy
History
How does it compare?
To SQL Benchmarking
To (non-XML) NoSQL Benchmarking?
What are we proposing?
A new "framework"
What can it do for us?
XML for document transformation
XSLT
Efficient transformation in the large
XML for structured data storage
XML Document Stores
Documents stored unprocessed
XML Databases
Documents decomposed
XQuery
Query databases
Query documents
The Initial XML Benchmarking Effort
An XML version of 007 benchmark
OODBMS benchmark
XML as an alternative to SQL databases
Pre XQuery standard
Different DBs had different query languages
Problems translating queries for each system
An application-level benchmark
Characterising the broad performance of a whole system
Exercises as many features as possible
Scalable workload generation
xmlgen tool (C89)
Generate an auction database
Structured/linked elements
Text-heavy elements
Numbers of components scaled by benchmark size
Queries
Fixed set of queries
Allows consistent comparison of XML DBs/stores
A microbenchmark
Focused on isolating detailed aspects of performance
Use targeted as a developer tool
Compared 3 databases that support XML
Commercial XML DB
Commercial ORDBMS
Timber DB being developed at Michigan
Identified several performance issues for development
A standardised environment for XML DB benchmarking
Aim for objective comparison of multiple systems
Configure preloaded workloads and queries
Invoke an experiment via Web API
A known workload and query set on a specific back end
E.g. run XMark at scale 10 on eXist-db
Record results of experiments
Extensible
New workloads, queries, experiments
All experimental results recorded
Extensible Stylesheet Language Transformations
Shares use cases with XQuery
XPath in common
Processing model
Take part(s) of input document(s)
Transform them
Create output document(s)
XT-Speedo (2014)
An XSLT benchmark
XSLT-focused workloads differ from XQuery workloads
Partly by user choice
Pragmatic collection of XSLT-focused documents, stylesheets
Stylesheet compilation is a factor
The Concept
Benchmark Framework or Testbed
OLTP-Bench (2013)
Testbed (their term) for SQL Databases
Goals
Driving relational DBMSs via standard interfaces
Tightly controlling the transaction mixture, request rate, and access distribution of the workload
Automatically gathering a rich set of performance and resource statistics
Synthetic and Real Data & Workloads
Mixed and Evolving Workloads
Change the rate, composition, and access distribution of workloads dynamically
Simulate real-life events and evolving workloads
Fine-Grained Rate Control
Control request rates with precision
Flexible Workload Generation
Open, closed, and semi-open loop systems
Transactional throughput Scalability
Without being restricted by clients
Testbed for NoSQL Database benchmarking
Pluggable architecture
Adapters for CQL, MongoDB, DynamoDB, etc.
Concurrent workload dispatch
Scriptable workloads (YAML)
Define operation templates
Define distribution of operation dispatch
Virtual Data Sets (VDS) for op field values
Rich set of functional generators
Value a function of the operation cycle number
Efficient generation and substitution into ops
Repeatable
XML:DB API Adapter
Drive XML Database(s) from NoSQLBench
YAML scripted XML tests
VDS-based XML Generation
Flexible synthetic-XML generation in Java 21
Reproduced XMark's xmlgen
Canonical auction site example
Scalable
About 50% slower than the C89 (ANSI C) version
Scope for improvement
scenarios:
default:
# single shot operation to clear / reset the database contents
# elemental driver requires an endpoint parameter
schema: >
run driver=elemental
tags==block:"schema.*"
threads==1
cycles==UNDEF
endpoint=xmldb://localhost:3808
# different ops in the "write" schema occur on intervals, at the ratios declared by the ops
# elemental driver requires an endpoint parameter
write: >
run driver=elemental
tags==block:"write.*",
cycles===TEMPLATE(write-cycles,TEMPLATE(docscount,1000))
seq=interval
threads=auto
errors=timer,warn
endpoint=xmldb://localhost:3808
# Use example built in VDS functions
user_id: ToHashedUUID(); ToString() -> String
created_on: Uniform(1262304000,1577836800) -> long
gender: WeightedStrings('M:10;F:10;O:1')
city: Cities()
# We added our own function to select a paragraph of a Gutenberg book
text: WarAndPeace()
# Create a full name by joining first and last names
full_name: ListSized(FixedValue(2), FirstNames(), LastNames()); Join(' ')
# Insert a pseudo-random person record
op: |
xquery version "3.1";
...
let $record :=
<person id="{user_id}" seq="{seq_key}">
<name>{full_name}</name>
<text>{alttext}</text>
</person>
return
xmldb:store("/{collection}/testnb/beta", "{random_key}", $record)
Use the VDS part of NoSQLBench as a Java library
Declare bindings in code
Syntactic sugar around XML generation
Emit an element with value generated by the VDS function
LongFunction<String> fullNames = (LongFunction<String>) new Flow(
new ListSized(
new FixedValue(2),
new FirstNames(),
new LastNames()
), new Join<String>(" "));
LongFunction<String> education = new WeightedStrings(
"Other: 3, High School: 10, College: 10, Graduate School: 4");
element("name", fullNames);
element("education", education);
Compose new VDS functions
LongFunction<String> emailNames = new AppendList<>(new ListFunctions(
new FirstNames(),
new FixedValue(" "),
new LastNames(),
new FixedValue(" mailto:"),
new LastNames(),
new FixedValue("@"),
new AlphaNumericString(4),
new FixedValue("."),
new WeightedStrings("com:10;org:8;co.uk:1")), "")
Parameterise VDS functions from configuration
var openAuctions = configuration.element(Configuration.Elements.OpenAuctions);
this.auctionRef = new ListSizedStepped(
new Zipf(10, 1.75),
new HashRange(openAuctions.from(), openAuctions.to()));
Create nested elements with lambdas
element("mailbox", this::mailbox);
protected void mailbox() {
var num = next(emailCount);
for (int i = 0; i < num; i++) {
element("mail", () -> {
element("from", emailNames);
element("to", emailNames);
element("date", Formatters.dateFmt.format(next(emailTime)));
text.build();
});
}
}
Scale factor 100
A large data set
Apple Macbook Pro, M1 Max CPU, 64MB RAM
xmlgen (C89 / ANSI C)
221s generation time
11,758MB XML file
xmlgen2 (Java 21, VDS library)
420s generation time
13,235MB XML file
Single threaded
Lots of room for performance improvements
Round trip test
Under NoSQLBench control (YAML script)
Generate XML
Load it into an XML:DB API database
Perform a set of simple benchmark queries
YAML Scripts for common benchmarks
Reproducible
Distributable
Automatable
NoSQLBench (our fork with XML:DB API driver):
https://github.com/evolvedbinary/nosqlbench