Alan Paxton, and Adam Retter

tech@evolvedbinary.com
 

XML Prague
@ University of Economics, Prague
2024-06-08

 

github.com/evolvedbinary

Modern
Benchmarking of XQuery
and XML Databases

Alan
Paxton

alan@evolvedbinary.com
@alanpaxton
  • Senior Engineer @Evolved Binary

  • Databases, Concurrency, and Performance

  • Developer

    • Java

    • C++

    • XML and XQuery

  • Contributor to eXist-db since 2021

XML Benchmarking

  • What does it look like now?

    • Taxonomy

    • History

  • How does it compare?

    • To SQL Benchmarking

    • To (non-XML) NoSQL Benchmarking?

  • What are we proposing?

    • A new "framework"

    • What can it do for us?

XML Benchmarking Taxonomy

  • XML for document transformation

    • XSLT

      • Efficient transformation in the large

  • XML for structured data storage

    • XML Document Stores

      • Documents stored unprocessed

    • XML Databases

      • Documents decomposed

  • XQuery

    • Query databases

    • Query documents

The Growth of XML

XML Benchmarking History

X007 (2001)

  • The Initial XML Benchmarking Effort

  • An XML version of 007 benchmark

    • OODBMS benchmark

  • XML as an alternative to SQL databases

  • Pre XQuery standard

    • Different DBs had different query languages

    • Problems translating queries for each system

XMark - The XML Benchmark Project (2002)

  • An application-level benchmark

    • Characterising the broad performance of a whole system

    • Exercises as many features as possible

  • Scalable workload generation

    • xmlgen tool (C89)

    • Generate an auction database

      • Structured/linked elements

      • Text-heavy elements

      • Numbers of components scaled by benchmark size

  • Queries

    • Fixed set of queries

    • Allows consistent comparison of XML DBs/stores

Michigan Benchmark (2006)

  • A microbenchmark

    • Focused on isolating detailed aspects of performance

    • Use targeted as a developer tool

  • Compared 3 databases that support XML

    • Commercial XML DB

    • Commercial ORDBMS

    • Timber DB being developed at Michigan

      • Identified several performance issues for development

XQBench (2011)

  • A standardised environment for XML DB benchmarking

  • Aim for objective comparison of multiple systems

  • Configure preloaded workloads and queries

  • Invoke an experiment via Web API

    • A known workload and query set on a specific back end

    • E.g. run XMark at scale 10 on eXist-db

  • Record results of experiments

  • Extensible

    • New workloads, queries, experiments

  • All experimental results recorded

XSLT Benchmarking

  • Extensible Stylesheet Language Transformations

    • Shares use cases with XQuery

      • XPath in common

    • Processing model

      • Take part(s) of input document(s)

      • Transform them

      • Create output document(s)

  • XT-Speedo (2014)

    • An XSLT benchmark

    • XSLT-focused workloads differ from XQuery workloads

      • Partly by user choice

      • Pragmatic collection of XSLT-focused documents, stylesheets

    • Stylesheet compilation is a factor

SQL Benchmark Frameworks

  • The Concept

    • Benchmark Framework or Testbed

  • OLTP-Bench (2013)

    • Testbed (their term) for SQL Databases

  • Goals

    1. Driving relational DBMSs via standard interfaces

    2. Tightly controlling the transaction mixture, request rate, and access distribution of the workload

    3. Automatically gathering a rich set of performance and resource statistics

SQL Testbed Requirements

  • Synthetic and Real Data & Workloads

  • Mixed and Evolving Workloads

    • Change the rate, composition, and access distribution of workloads dynamically

    • Simulate real-life events and evolving workloads

  • Fine-Grained Rate Control

    • Control request rates with precision

  • Flexible Workload Generation

    • Open, closed, and semi-open loop systems

  • Transactional throughput Scalability

    • Without being restricted by clients

From OLTP-Bench

NoSQLBench (2020)

  • Testbed for NoSQL Database benchmarking

  • Pluggable architecture

    • Adapters for CQL, MongoDB, DynamoDB, etc.

  • Concurrent workload dispatch

  • Scriptable workloads (YAML)

    • Define operation templates

    • Define distribution of operation dispatch

  • Virtual Data Sets (VDS) for op field values

    • Rich set of functional generators

      • Value a function of the operation cycle number

      • Efficient generation and substitution into ops

      • Repeatable

XQuery and XML Databases

Benchmarking

with NoSQLBench

NoSQLBench - XML Extensions

  • XML:DB API Adapter

    • Drive XML Database(s) from NoSQLBench

      • YAML scripted XML tests

  • VDS-based XML Generation

    • Flexible synthetic-XML generation in Java 21

    • Reproduced XMark's xmlgen

      • Canonical auction site example

      • Scalable

    • About 50% slower than the C89 (ANSI C) version

      • Scope for improvement

Workload Configuration

scenarios:

  default:

    # single shot operation to clear / reset the database contents
    # elemental driver requires an endpoint parameter
    schema: >
      run driver=elemental
      tags==block:"schema.*"
      threads==1
      cycles==UNDEF
      endpoint=xmldb://localhost:3808


    # different ops in the "write" schema occur on intervals, at the ratios declared by the ops
    # elemental driver requires an endpoint parameter
    write: >
      run driver=elemental
      tags==block:"write.*",
      cycles===TEMPLATE(write-cycles,TEMPLATE(docscount,1000))
      seq=interval
      threads=auto
      errors=timer,warn
      endpoint=xmldb://localhost:3808

Configuration Bindings

# Use example built in VDS functions 

user_id: ToHashedUUID(); ToString() -> String

created_on: Uniform(1262304000,1577836800) -> long

gender: WeightedStrings('M:10;F:10;O:1')

city: Cities()


# We added our own function to select a paragraph of a Gutenberg book

text: WarAndPeace()


# Create a full name by joining first and last names

full_name: ListSized(FixedValue(2), FirstNames(), LastNames()); Join(' ')

XML:DB API Adapter op

# Insert a pseudo-random person record
op: |
  xquery version "3.1";

  ...

  let $record :=
      <person id="{user_id}" seq="{seq_key}">
        <name>{full_name}</name>
        <text>{alttext}</text>
      </person>
  return
    xmldb:store("/{collection}/testnb/beta", "{random_key}", $record)

XML Generation

  • Use the VDS part of NoSQLBench as a Java library

    • Declare bindings in code

  • Syntactic sugar around XML generation

    • Emit an element with value generated by the VDS function

Virtual Data Set Based Tool

LongFunction<String> fullNames = (LongFunction<String>) new Flow(
    new ListSized(
        new FixedValue(2),
        new FirstNames(),
        new LastNames()
    ), new Join<String>(" "));

LongFunction<String> education = new WeightedStrings(
    "Other: 3, High School: 10, College: 10, Graduate School: 4");
element("name", fullNames);

element("education", education);

NoSQLBench XML Generation

  • Compose new VDS functions

LongFunction<String> emailNames = new AppendList<>(new ListFunctions(
    new FirstNames(),
    new FixedValue(" "),
    new LastNames(),
    new FixedValue(" mailto:"),
    new LastNames(),
    new FixedValue("@"),
    new AlphaNumericString(4),
    new FixedValue("."),
    new WeightedStrings("com:10;org:8;co.uk:1")), "")

NoSQLBench XML Generation

  • Parameterise VDS functions from configuration

var openAuctions = configuration.element(Configuration.Elements.OpenAuctions);

this.auctionRef = new ListSizedStepped(
    new Zipf(10, 1.75),
    new HashRange(openAuctions.from(), openAuctions.to()));

NoSQLBench XML Generation

  • Create nested elements with lambdas

element("mailbox", this::mailbox);

protected void mailbox() {
  var num = next(emailCount);
  for (int i = 0; i < num; i++) {
    element("mail", () -> {
        element("from", emailNames);
        element("to", emailNames);
        element("date", Formatters.dateFmt.format(next(emailTime)));
        text.build();
    });
  }
}

xmlgen vs xmlgen2

  • Scale factor 100

    • A large data set

  • Apple Macbook Pro, M1 Max CPU, 64MB RAM

  • xmlgen (C89 / ANSI C)

    • 221s generation time

    • 11,758MB XML file

  • xmlgen2 (Java 21, VDS library)

    • 420s generation time

    • 13,235MB XML file

    • Single threaded

      • Lots of room for performance improvements

Next Steps

  • Round trip test

    • Under NoSQLBench control (YAML script)

      • Generate XML

      • Load it into an XML:DB API database

      • Perform a set of simple benchmark queries

  • YAML Scripts for common benchmarks

    • Reproducible

    • Distributable

    • Automatable

Questions?

Thank You

Modern Benchmarking of XQuery and XML Databases

By Adam Retter

Modern Benchmarking of XQuery and XML Databases

Presentation given for XML Prague - 8 June 2024 - University of Economics, Prague

  • 77