Alan Paxton, and Adam Retter

tech@evolvedbinary.com

XML Prague
@ University of Economics, Prague
2024-06-08

github.com/evolvedbinary

Modern
Benchmarking of XQuery
and XML Databases

Alan
Paxton

alan@evolvedbinary.com
@alanpaxton

Senior Engineer @Evolved Binary
Databases, Concurrency, and Performance
Developer
- Java
- C++
- XML and XQuery
Contributor to eXist-db since 2021

XML Benchmarking

What does it look like now?
- Taxonomy
- History
How does it compare?
- To SQL Benchmarking
- To (non-XML) NoSQL Benchmarking?
What are we proposing?
- A new "framework"
- What can it do for us?

XML Benchmarking Taxonomy

XML for document transformation
- XSLT
  - Efficient transformation in the large
XML for structured data storage
- XML Document Stores
  - Documents stored unprocessed
- XML Databases
  - Documents decomposed
XQuery
- Query databases
- Query documents

The Growth of XML

XML Benchmarking History

X007 (2001)

The Initial XML Benchmarking Effort
An XML version of 007 benchmark
- OODBMS benchmark
XML as an alternative to SQL databases
Pre XQuery standard
- Different DBs had different query languages
- Problems translating queries for each system

XMark - The XML Benchmark Project (2002)

An application-level benchmark
- Characterising the broad performance of a whole system
- Exercises as many features as possible
Scalable workload generation
- xmlgen tool (C89)
- Generate an auction database
  - Structured/linked elements
  - Text-heavy elements
  - Numbers of components scaled by benchmark size
Queries
- Fixed set of queries
- Allows consistent comparison of XML DBs/stores

Michigan Benchmark (2006)

A microbenchmark
- Focused on isolating detailed aspects of performance
- Use targeted as a developer tool
Compared 3 databases that support XML
- Commercial XML DB
- Commercial ORDBMS
- Timber DB being developed at Michigan
  - Identified several performance issues for development

XQBench (2011)

A standardised environment for XML DB benchmarking
Aim for objective comparison of multiple systems
Configure preloaded workloads and queries
Invoke an experiment via Web API
- A known workload and query set on a specific back end
- E.g. run XMark at scale 10 on eXist-db
Record results of experiments
Extensible
- New workloads, queries, experiments
All experimental results recorded

XSLT Benchmarking

Extensible Stylesheet Language Transformations
- Shares use cases with XQuery
  - XPath in common
- Processing model
  - Take part(s) of input document(s)
  - Transform them
  - Create output document(s)
XT-Speedo (2014)
- An XSLT benchmark
- XSLT-focused workloads differ from XQuery workloads
  - Partly by user choice
  - Pragmatic collection of XSLT-focused documents, stylesheets
- Stylesheet compilation is a factor

SQL Benchmark Frameworks

The Concept
- Benchmark Framework or Testbed
OLTP-Bench (2013)
- Testbed (their term) for SQL Databases
Goals
1. Driving relational DBMSs via standard interfaces
2. Tightly controlling the transaction mixture, request rate, and access distribution of the workload
3. Automatically gathering a rich set of performance and resource statistics

SQL Testbed Requirements

Synthetic and Real Data & Workloads
Mixed and Evolving Workloads
- Change the rate, composition, and access distribution of workloads dynamically
- Simulate real-life events and evolving workloads
Fine-Grained Rate Control
- Control request rates with precision
Flexible Workload Generation
- Open, closed, and semi-open loop systems
Transactional throughput Scalability
- Without being restricted by clients

From OLTP-Bench

NoSQLBench (2020)

Testbed for NoSQL Database benchmarking
Pluggable architecture
- Adapters for CQL, MongoDB, DynamoDB, etc.
Concurrent workload dispatch
Scriptable workloads (YAML)
- Define operation templates
- Define distribution of operation dispatch
Virtual Data Sets (VDS) for op field values
- Rich set of functional generators
  - Value a function of the operation cycle number
  - Efficient generation and substitution into ops
  - Repeatable

XQuery and XML Databases

Benchmarking

with NoSQLBench

NoSQLBench - XML Extensions

XML:DB API Adapter
- Drive XML Database(s) from NoSQLBench
  - YAML scripted XML tests
VDS-based XML Generation
- Flexible synthetic-XML generation in Java 21
- Reproduced XMark's xmlgen
  - Canonical auction site example
  - Scalable
- About 50% slower than the C89 (ANSI C) version
  - Scope for improvement

Workload Configuration

scenarios:

  default:

    # single shot operation to clear / reset the database contents
    # elemental driver requires an endpoint parameter
    schema: >
      run driver=elemental
      tags==block:"schema.*"
      threads==1
      cycles==UNDEF
      endpoint=xmldb://localhost:3808


    # different ops in the "write" schema occur on intervals, at the ratios declared by the ops
    # elemental driver requires an endpoint parameter
    write: >
      run driver=elemental
      tags==block:"write.*",
      cycles===TEMPLATE(write-cycles,TEMPLATE(docscount,1000))
      seq=interval
      threads=auto
      errors=timer,warn
      endpoint=xmldb://localhost:3808

Configuration Bindings

# Use example built in VDS functions 

user_id: ToHashedUUID(); ToString() -> String

created_on: Uniform(1262304000,1577836800) -> long

gender: WeightedStrings('M:10;F:10;O:1')

city: Cities()


# We added our own function to select a paragraph of a Gutenberg book

text: WarAndPeace()


# Create a full name by joining first and last names

full_name: ListSized(FixedValue(2), FirstNames(), LastNames()); Join(' ')

XML:DB API Adapter op

# Insert a pseudo-random person record
op: |
  xquery version "3.1";

  ...

  let $record :=
      <person id="{user_id}" seq="{seq_key}">
        <name>{full_name}</name>
        <text>{alttext}</text>
      </person>
  return
    xmldb:store("/{collection}/testnb/beta", "{random_key}", $record)

XML Generation

Use the VDS part of NoSQLBench as a Java library
- Declare bindings in code

Syntactic sugar around XML generation
- Emit an element with value generated by the VDS function

Virtual Data Set Based Tool

LongFunction<String> fullNames = (LongFunction<String>) new Flow(
    new ListSized(
        new FixedValue(2),
        new FirstNames(),
        new LastNames()
    ), new Join<String>(" "));

LongFunction<String> education = new WeightedStrings(
    "Other: 3, High School: 10, College: 10, Graduate School: 4");

element("name", fullNames);

element("education", education);

NoSQLBench XML Generation

Compose new VDS functions

LongFunction<String> emailNames = new AppendList<>(new ListFunctions(
    new FirstNames(),
    new FixedValue(" "),
    new LastNames(),
    new FixedValue(" mailto:"),
    new LastNames(),
    new FixedValue("@"),
    new AlphaNumericString(4),
    new FixedValue("."),
    new WeightedStrings("com:10;org:8;co.uk:1")), "")

NoSQLBench XML Generation

Parameterise VDS functions from configuration

var openAuctions = configuration.element(Configuration.Elements.OpenAuctions);

this.auctionRef = new ListSizedStepped(
    new Zipf(10, 1.75),
    new HashRange(openAuctions.from(), openAuctions.to()));

NoSQLBench XML Generation

Create nested elements with lambdas

element("mailbox", this::mailbox);

protected void mailbox() {
  var num = next(emailCount);
  for (int i = 0; i < num; i++) {
    element("mail", () -> {
        element("from", emailNames);
        element("to", emailNames);
        element("date", Formatters.dateFmt.format(next(emailTime)));
        text.build();
    });
  }
}

xmlgen vs xmlgen2

Scale factor 100
- A large data set
Apple Macbook Pro, M1 Max CPU, 64MB RAM
xmlgen (C89 / ANSI C)
- 221s generation time
- 11,758MB XML file
xmlgen2 (Java 21, VDS library)
- 420s generation time
- 13,235MB XML file
- Single threaded
  - Lots of room for performance improvements

Next Steps

Round trip test
- Under NoSQLBench control (YAML script)
  - Generate XML
  - Load it into an XML:DB API database
  - Perform a set of simple benchmark queries
YAML Scripts for common benchmarks
- Reproducible
- Distributable
- Automatable

Questions?

NoSQLBench (our fork with XML:DB API driver):
https://github.com/evolvedbinary/nosqlbench
xmlgen2
https://github.com/evolvedbinary/xmlgen2

Alan Paxton, and Adam Retter

tech@evolvedbinary.com

github.com/evolvedbinary

Modern Benchmarking of XQuery and XML Databases

Alan Paxton

alan@evolvedbinary.com @alanpaxton

XML Benchmarking

XML Benchmarking Taxonomy

The Growth of XML

XML Benchmarking History

X007 (2001)

XMark - The XML Benchmark Project (2002)

Michigan Benchmark (2006)

XQBench (2011)

XSLT Benchmarking

SQL Benchmark Frameworks

SQL Testbed Requirements

From OLTP-Bench

NoSQLBench (2020)

XQuery and XML Databases

Benchmarking

with NoSQLBench

NoSQLBench - XML Extensions

Workload Configuration

Configuration Bindings

XML:DB API Adapter op

XML Generation

Virtual Data Set Based Tool

NoSQLBench XML Generation

NoSQLBench XML Generation

NoSQLBench XML Generation

xmlgen vs xmlgen2

Next Steps

Questions?

Thank You

Modern
Benchmarking of XQuery
and XML Databases

Alan
Paxton

alan@evolvedbinary.com
@alanpaxton