Alan Paxton, and Adam Retter
tech@evolvedbinary.com
 
XML Prague
            @ University of Economics, Prague
            2024-06-08
    github.com/evolvedbinary
Modern
Benchmarking of XQuery
and XML Databases
    Alan
Paxton

    alan@evolvedbinary.com
@alanpaxton
- 
	
Senior Engineer @Evolved Binary
 - 
	
Databases, Concurrency, and Performance
 - 
	
Developer
- 
		
Java
 - 
		
C++
 - 
		
XML and XQuery
 
 - 
		
 - 
	
Contributor to eXist-db since 2021
 
XML Benchmarking
- 
                
What does it look like now?
- 
                        
Taxonomy
 - 
                        
History
 
 - 
                        
 - 
                
How does it compare?
- 
                        
To SQL Benchmarking
 - 
                        
To (non-XML) NoSQL Benchmarking?
 
 - 
                        
 - 
                
What are we proposing?
- 
                        
A new "framework"
 - 
                        
What can it do for us?
 
 - 
                        
 
    XML Benchmarking Taxonomy
- 
                
XML for document transformation
- 
                        
XSLT
- 
                                
Efficient transformation in the large
 
 - 
                                
 
 - 
                        
 - 
                
XML for structured data storage
- 
                        
XML Document Stores
- 
                                
Documents stored unprocessed
 
 - 
                                
 - 
                        
XML Databases
- 
                                
Documents decomposed
 
 - 
                                
 
 - 
                        
 - 
                
XQuery
- 
                        
Query databases
 - 
                        
Query documents
 
 - 
                        
 
    The Growth of XML
XML Benchmarking History
    X007 (2001)
- 
                
The Initial XML Benchmarking Effort
 - 
                
An XML version of 007 benchmark
- 
                      
OODBMS benchmark
 
 - 
                      
 - 
                
XML as an alternative to SQL databases
 - 
                
Pre XQuery standard
- 
                        
Different DBs had different query languages
 - 
                        
Problems translating queries for each system
 
 - 
                        
 
    XMark - The XML Benchmark Project (2002)
- 
                
An application-level benchmark
- 
                        
Characterising the broad performance of a whole system
 - 
                        
Exercises as many features as possible
 
 - 
                        
 - 
                
Scalable workload generation
- 
                        
xmlgen tool (C89)
 - 
                        
Generate an auction database
- 
                                
Structured/linked elements
 - 
                                
Text-heavy elements
 - 
                                
Numbers of components scaled by benchmark size
 
 - 
                                
 
 - 
                        
 - 
                
Queries
- 
                        
Fixed set of queries
 - 
                        
Allows consistent comparison of XML DBs/stores
 
 - 
                        
 
    Michigan Benchmark (2006)
- 
                
A microbenchmark
- 
                        
Focused on isolating detailed aspects of performance
 - 
                        
Use targeted as a developer tool
 
 - 
                        
 - 
                
Compared 3 databases that support XML
- 
                        
Commercial XML DB
 - 
                        
Commercial ORDBMS
 - 
                        
Timber DB being developed at Michigan
- 
                                
Identified several performance issues for development
 
 - 
                                
 
 - 
                        
 
    XQBench (2011)
- 
                
A standardised environment for XML DB benchmarking
 - 
                
Aim for objective comparison of multiple systems
 - 
                
Configure preloaded workloads and queries
 - 
                
Invoke an experiment via Web API
- 
                        
A known workload and query set on a specific back end
 - 
                        
E.g. run XMark at scale 10 on eXist-db
 
 - 
                        
 - 
                
Record results of experiments
 - 
                
Extensible
- 
                        
New workloads, queries, experiments
 
 - 
                        
 - 
                
All experimental results recorded
 
    XSLT Benchmarking
- 
                
Extensible Stylesheet Language Transformations
- 
                        
Shares use cases with XQuery
- 
                                
XPath in common
 
 - 
                                
 - 
                        
Processing model
- 
                                
Take part(s) of input document(s)
 - 
                                
Transform them
 - 
                                
Create output document(s)
 
 - 
                                
 
 - 
                        
 - 
                
XT-Speedo (2014)
- 
                        
An XSLT benchmark
 - 
                        
XSLT-focused workloads differ from XQuery workloads
- 
                                
Partly by user choice
 - 
                                
Pragmatic collection of XSLT-focused documents, stylesheets
 
 - 
                                
 - 
                        
Stylesheet compilation is a factor
 
 - 
                        
 
    SQL Benchmark Frameworks
- 
                
The Concept
- 
                        
Benchmark Framework or Testbed
 
 - 
                        
 - 
                
OLTP-Bench (2013)
- 
                        
Testbed (their term) for SQL Databases
 
 - 
                        
 - 
                
Goals
- 
                        
Driving relational DBMSs via standard interfaces
 - 
                        
Tightly controlling the transaction mixture, request rate, and access distribution of the workload
 - 
                        
Automatically gathering a rich set of performance and resource statistics
 
 - 
                        
 
    SQL Testbed Requirements
- 
                
Synthetic and Real Data & Workloads
 - 
                
Mixed and Evolving Workloads
- 
                        
Change the rate, composition, and access distribution of workloads dynamically
 - 
                        
Simulate real-life events and evolving workloads
 
 - 
                        
 - 
                
Fine-Grained Rate Control
- 
                        
Control request rates with precision
 
 - 
                        
 - 
                
Flexible Workload Generation
- 
                        
Open, closed, and semi-open loop systems
 
 - 
                        
 - 
                
Transactional throughput Scalability
- 
                        
Without being restricted by clients
 
 - 
                        
 
    From OLTP-Bench
NoSQLBench (2020)
- 
                
Testbed for NoSQL Database benchmarking
 - 
                
Pluggable architecture
- 
                        
Adapters for CQL, MongoDB, DynamoDB, etc.
 
 - 
                        
 - 
                
Concurrent workload dispatch
 - 
                
Scriptable workloads (YAML)
- 
                        
Define operation templates
 - 
                        
Define distribution of operation dispatch
 
 - 
                        
 - 
                
Virtual Data Sets (VDS) for op field values
- 
                        
Rich set of functional generators
- 
                                
Value a function of the operation cycle number
 - 
                                
Efficient generation and substitution into ops
 - 
                                
Repeatable
 
 - 
                                
 
 - 
                        
 
    
    XQuery and XML Databases
Benchmarking
with NoSQLBench
NoSQLBench - XML Extensions
- 
                
XML:DB API Adapter
- 
                        
Drive XML Database(s) from NoSQLBench
- 
                                
YAML scripted XML tests
 
 - 
                                
 
 - 
                        
 - 
                
VDS-based XML Generation
- 
                        
Flexible synthetic-XML generation in Java 21
 - 
                        
Reproduced XMark's xmlgen
- 
                                
Canonical auction site example
 - 
                                
Scalable
 
 - 
                                
 - 
                        
About 50% slower than the C89 (ANSI C) version
- 
                                
Scope for improvement
 
 - 
                                
 
 - 
                        
 
    Workload Configuration
    scenarios:
  default:
    # single shot operation to clear / reset the database contents
    # elemental driver requires an endpoint parameter
    schema: >
      run driver=elemental
      tags==block:"schema.*"
      threads==1
      cycles==UNDEF
      endpoint=xmldb://localhost:3808
    # different ops in the "write" schema occur on intervals, at the ratios declared by the ops
    # elemental driver requires an endpoint parameter
    write: >
      run driver=elemental
      tags==block:"write.*",
      cycles===TEMPLATE(write-cycles,TEMPLATE(docscount,1000))
      seq=interval
      threads=auto
      errors=timer,warn
      endpoint=xmldb://localhost:3808
    Configuration Bindings
    # Use example built in VDS functions 
user_id: ToHashedUUID(); ToString() -> String
created_on: Uniform(1262304000,1577836800) -> long
gender: WeightedStrings('M:10;F:10;O:1')
city: Cities()
# We added our own function to select a paragraph of a Gutenberg book
text: WarAndPeace()
# Create a full name by joining first and last names
full_name: ListSized(FixedValue(2), FirstNames(), LastNames()); Join(' ')XML:DB API Adapter op
    # Insert a pseudo-random person record
op: |
  xquery version "3.1";
  ...
  let $record :=
      <person id="{user_id}" seq="{seq_key}">
        <name>{full_name}</name>
        <text>{alttext}</text>
      </person>
  return
    xmldb:store("/{collection}/testnb/beta", "{random_key}", $record)XML Generation
- 
                
Use the VDS part of NoSQLBench as a Java library
- 
                        
Declare bindings in code
 
 - 
                        
 
- 
                
Syntactic sugar around XML generation
- 
                        
Emit an element with value generated by the VDS function
 
 - 
                        
 
    Virtual Data Set Based Tool
LongFunction<String> fullNames = (LongFunction<String>) new Flow(
    new ListSized(
        new FixedValue(2),
        new FirstNames(),
        new LastNames()
    ), new Join<String>(" "));
LongFunction<String> education = new WeightedStrings(
    "Other: 3, High School: 10, College: 10, Graduate School: 4");
    element("name", fullNames);
element("education", education);
    NoSQLBench XML Generation
Compose new VDS functions
    LongFunction<String> emailNames = new AppendList<>(new ListFunctions(
    new FirstNames(),
    new FixedValue(" "),
    new LastNames(),
    new FixedValue(" mailto:"),
    new LastNames(),
    new FixedValue("@"),
    new AlphaNumericString(4),
    new FixedValue("."),
    new WeightedStrings("com:10;org:8;co.uk:1")), "")
    NoSQLBench XML Generation
- 
                
Parameterise VDS functions from configuration
 
    var openAuctions = configuration.element(Configuration.Elements.OpenAuctions);
this.auctionRef = new ListSizedStepped(
    new Zipf(10, 1.75),
    new HashRange(openAuctions.from(), openAuctions.to()));
    NoSQLBench XML Generation
- 
                
Create nested elements with lambdas
 
    element("mailbox", this::mailbox);
protected void mailbox() {
  var num = next(emailCount);
  for (int i = 0; i < num; i++) {
    element("mail", () -> {
        element("from", emailNames);
        element("to", emailNames);
        element("date", Formatters.dateFmt.format(next(emailTime)));
        text.build();
    });
  }
}
    xmlgen vs xmlgen2
- 
                
Scale factor 100
- 
                        
A large data set
 
 - 
                        
 - 
                
Apple Macbook Pro, M1 Max CPU, 64MB RAM
 - 
                
xmlgen (C89 / ANSI C)
- 
                        
221s generation time
 - 
                        
11,758MB XML file
 
 - 
                        
 - 
                
xmlgen2 (Java 21, VDS library)
- 
                        
420s generation time
 - 
                        
13,235MB XML file
 - 
                        
Single threaded
- 
                                
Lots of room for performance improvements
 
 - 
                                
 
 - 
                        
 
    Next Steps
- 
                
Round trip test
- 
                        
Under NoSQLBench control (YAML script)
- 
                                
Generate XML
 - 
                                
Load it into an XML:DB API database
 - 
                                
Perform a set of simple benchmark queries
 
 - 
                                
 
 - 
                        
 - 
                
YAML Scripts for common benchmarks
- 
                        
Reproducible
 - 
                        
Distributable
 - 
                        
Automatable
 
 - 
                        
 
    
    Questions?
    - 
                
NoSQLBench (our fork with XML:DB API driver):
https://github.com/evolvedbinary/nosqlbench 
Thank You
Modern Benchmarking of XQuery and XML Databases
By Adam Retter
Modern Benchmarking of XQuery and XML Databases
Presentation given for XML Prague - 8 June 2024 - University of Economics, Prague
- 494