![](https://s3.amazonaws.com/media-p.slid.es/uploads/119594/images/1461650/Screen_Shot_2015-06-05_at_13.12.39.png)
The Design and Implementation of
XML Prague
2019-02-08
Adam Retter
adam@evolvedbinary.com
![](https://s3.amazonaws.com/media-p.slid.es/uploads/119594/images/4584337/twitter-logo.png)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/119594/images/4584338/gmail-logo.png)
@adamretter
![](https://s3.amazonaws.com/media-p.slid.es/uploads/119594/images/5756084/Screenshot_2019-02-07_at_07.43.22.png)
Why did we start in 2014?
-
Personal Concerns
-
Open Source NXDs problems/limitations are not being addresed
-
Commercial NXDs are Expensive and not Open Source
-
New NoSQL (JSON) document db are out-innovating us
-
10 Years invested in Open Source NXD, unhappy with progress
-
-
Commercial Concerns from Customers
-
Help! Our Open Source NXD sometimes:
-
Crashes and Corrupts the database
-
Stops responding
-
Won't Scale with new servers/users
-
-
![](https://s3.amazonaws.com/media-p.slid.es/uploads/119594/images/1461650/Screen_Shot_2015-06-05_at_13.12.39.png)
-
Reported by Users
-
Stability - Responsiveness / Deadlocks
-
Operational Reliability - Backup / Corruption / Fail-over
-
Missing Feature - Key/Value Metadata for Documents
-
Missing Feature - Triple/Graph linking for inter/cross-document references
-
-
Recognised by Developers
-
Correctness - Crash Recovery / Deadlock Avoidance / Deadlock Resolution / ACID
-
Performance - Reducing Contention / Avoiding System Maintenance mutex
-
Missing Features - Multi Model / Clustering
-
![](https://s3.amazonaws.com/media-p.slid.es/uploads/119594/images/1461650/Screen_Shot_2015-06-05_at_13.12.39.png)
OS NXD - Known Issues ~2014
Hold My Beer...
![](https://s3.amazonaws.com/media-p.slid.es/uploads/119594/images/1461650/Screen_Shot_2015-06-05_at_13.12.39.png)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/119594/images/5010877/IMG_20170426_163718.jpg)
I Gotta
Fix This!
-
Project Health?
-
Issues - Rate of decay, i.e. Open vs Closed over time
-
Attracts new contributors?
-
Attracts new and varied users?
-
How do contributors pay their bills?
-
-
Contributor Constraints?
-
How long to get PRs reviewed?
-
Open to radical changes? Incremental vs Big-bang?
-
Other developers with time/knowledge to review PRs
-
-
License
-
Business friendly? CLA?
-
-
Reputation - Perceived or otherwise
![](https://s3.amazonaws.com/media-p.slid.es/uploads/119594/images/1461650/Screen_Shot_2015-06-05_at_13.12.39.png)
Can we fix an existing NXD?
-
Project "Granite"
-
Research and Development
-
Primary Focus on Correctness and Stability
-
Never become unresponsive
-
Never crash
-
Never lose data or corrupt the database
-
-
-
Should become Open Source
-
Should be appealing to Commercial enterprises
-
Open Source license choice(s) vs Revenue opportunities
-
-
Don't reinvent wheels!
-
Reuse - Faster time to market
-
Developers know eXist-db... Fork it!
-
![](https://s3.amazonaws.com/media-p.slid.es/uploads/119594/images/1461650/Screen_Shot_2015-06-05_at_13.12.39.png)
Time to build something new
-
Why?
-
We don't trust it's correctness
-
Old and Creaky? - (dbXML ~2001)
-
Improved with caching and journaling
-
-
Not Scalable - single-threaded read/write
-
Classic B+ Tree
-
Why not fix it?
-
Newer/better algorthms exist - B-link Tree, Bw Tree, etc.
-
We want a giant-leap, not an incremental improvement
-
-
![](https://s3.amazonaws.com/media-p.slid.es/uploads/119594/images/1461650/Screen_Shot_2015-06-05_at_13.12.39.png)
First... Replace eXist-db's Storage Engine
How does a NXD Store an XML Document Anyway?
![](https://s3.amazonaws.com/media-p.slid.es/uploads/119594/images/1461650/Screen_Shot_2015-06-05_at_13.12.39.png)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/119594/images/5744241/shred.jpg)
...Shredding!
1 2 3 4 5 6 7 8 9 10 |
<fruits> <fruit> <name>Apple</name> <colour>Green</colour> </fruit> <fruit> <name>Banana</name> <colour>Yellow</colour> </fruit> </fruits> |
Given some very simple XML
- fruits.xml
![](https://s3.amazonaws.com/media-p.slid.es/uploads/119594/images/1461650/Screen_Shot_2015-06-05_at_13.12.39.png)
1. Number the tree (DLN)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/119594/images/1461650/Screen_Shot_2015-06-05_at_13.12.39.png)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/119594/images/5744325/node-numbering.png)
fruits.xml: docId=6
2. Extract Symbols
![](https://s3.amazonaws.com/media-p.slid.es/uploads/119594/images/1461650/Screen_Shot_2015-06-05_at_13.12.39.png)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/119594/images/5744763/node-symbols.png)
3. Store DOM
![](https://s3.amazonaws.com/media-p.slid.es/uploads/119594/images/1461650/Screen_Shot_2015-06-05_at_13.12.39.png)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/119594/images/5744954/dom-dbx.png)
4. Store Symbols
+ Collection Entry
![](https://s3.amazonaws.com/media-p.slid.es/uploads/119594/images/1461650/Screen_Shot_2015-06-05_at_13.12.39.png)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/119594/images/5744948/symbols-collections-dbx.png)
-
We won't develop our own!
-
It's hard (to get correct)!
-
Other well-resourced projects available for reuse
-
We would rather focus on the larger DBMS
-
-
Choosing a suitable engine
-
Other Java Database B+ Trees are unsuitable
-
Examined - Apache Derby, H2, HSQLDB, and Neo4j
-
-
Few Open Source pure Storage Engines in Java
-
Discounted MapDB - known issues
-
-
Identified 3 possibilities:
-
LMDB - B-Tree written in C
-
ForestDB - HB+-Trie written in C++11
-
RocksDB - LSM written in C++14
-
-
New Storage Engine
![](https://s3.amazonaws.com/media-p.slid.es/uploads/119594/images/1461650/Screen_Shot_2015-06-05_at_13.12.39.png)
-
Fork of Google's LevelDB
-
Performance Improvements for concurrency and I/O
-
Optimised for SSDs
-
-
Large Open Source community with commercial interests
-
e.g. Facebook, AirBnb, LinkedIn, NetFlix, Uber, Yahoo, etc.
-
-
Rich feature set
-
LSM-tree (Log Structured Merge Tree)
-
MVCC (Multi-Version Concurrency Control)
-
ACID
-
Built-in Atomicity and Durability
-
Offers primitives for building Consistency and Isolation
-
-
Column Families
-
Why we Adopted RocksDB
![](https://s3.amazonaws.com/media-p.slid.es/uploads/119594/images/1461650/Screen_Shot_2015-06-05_at_13.12.39.png)
How do we store XML into RocksDB?
![](https://s3.amazonaws.com/media-p.slid.es/uploads/119594/images/1461650/Screen_Shot_2015-06-05_at_13.12.39.png)
...Mooor Shredding!
![](https://s3.amazonaws.com/media-p.slid.es/uploads/119594/images/5745595/moor-shred.jpg)
-
We still:
-
Number the tree with DLN
-
Extract Symbols
-
Shred each node into a key/value pair
-
-
We do NOT use eXist's B+Tree or Variable Record Store
-
Instead, we use RocksDB Column Families
-
Each is an LSM-tree. Share a WAL
-
For each component
-
Persisent DOM Column Family - XML_DOM_STORE
-
Symbol Table Column Families - SYMBOL_STORE, NAME_INDEX, NAME_ID_INDEX
-
Collection Store Column Family - COLLECTION_STORE
-
-
Storing XML in RocksDB
![](https://s3.amazonaws.com/media-p.slid.es/uploads/119594/images/1461650/Screen_Shot_2015-06-05_at_13.12.39.png)
RocksDB's LSM Tree
![](https://s3.amazonaws.com/media-p.slid.es/uploads/119594/images/1461650/Screen_Shot_2015-06-05_at_13.12.39.png)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/119594/images/5745932/lsm.png)
XML in RocksDB's SSTable File
![](https://s3.amazonaws.com/media-p.slid.es/uploads/119594/images/1461650/Screen_Shot_2015-06-05_at_13.12.39.png)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/119594/images/5746070/SST.png)
-
eXist-db lacks strong transaction semantics
-
Txn - Log commit/abort just for Crash Recovery
-
Mostly just the Durability of ACID
-
Isolation level: ~Read Uncommitted
-
-
Our Transactions
-
RocksDB ensures Atomicity and Durability
-
We add Consistency and Isolation
-
Begin Transaction creates a db Snapshot
-
Each Transaction has an in-memory Write Batch
-
Write - only to in-memory Write Batch
-
Read - try in-memory write batch, fallback to Snapshot
-
-
Isolation level: >= Snapshot Isolation
-
ACID Transactions
![](https://s3.amazonaws.com/media-p.slid.es/uploads/119594/images/1461650/Screen_Shot_2015-06-05_at_13.12.39.png)
-
Each public API call is a transaction
-
e.g. REST / WebDAV / RESTXQ / XML-RPC / XML:DB
-
auto-abort on exception
-
auto-commit when the call returns data
-
-
Each XQuery is a transaction
-
auto-abort if the query raises an error
-
auto-commit when the query completes
-
Begin Transaction creates a db Snapshot
-
XQuery 3.0 try/catch
-
try - begins a new sub-transaction
-
catch - auto-abort, the operations in the try body are undone
-
Sub-transactions can be nested, just like try/catch
-
-
Transactions for Users
![](https://s3.amazonaws.com/media-p.slid.es/uploads/119594/images/1461650/Screen_Shot_2015-06-05_at_13.12.39.png)
-
Key/Value Model
-
Metadata for Documents and Collections
-
Searchable from XQuery
-
-
Online Backup
-
Lock free
-
Checkpoint Backup
-
Full Document Export
-
-
UUID Locators
-
Persist across backups and nodes
-
-
BLOB Store
-
Deduplication
-
Other new features include...
![](https://s3.amazonaws.com/media-p.slid.es/uploads/119594/images/1461650/Screen_Shot_2015-06-05_at_13.12.39.png)
-
Many Changes Upstreamed
-
eXist-db - Locking, BLOB Store, Concurrency, etc.
-
RocksDB - further Java APIs and improved JNI performance
-
-
With hindsight...
-
Wouldn't fork eXist-db...
-
Too much time spent discovering and fixing bugs
-
Start with a green field, add eXist-db compatible APIs
-
-
Much more work than anticipated
-
-
Today, new storage engines
-
FoundationDB / BadgerDB / FASTER
-
Reflections
![](https://s3.amazonaws.com/media-p.slid.es/uploads/119594/images/1461650/Screen_Shot_2015-06-05_at_13.12.39.png)
-
Benchmarking and Performance
-
JSON Native
-
Virtualised Collections
-
Distributed Cluster
-
Graph Model
-
XQuery compilation to native
CPU/GPU code
On the roadmap...
![](https://s3.amazonaws.com/media-p.slid.es/uploads/119594/images/1461650/Screen_Shot_2015-06-05_at_13.12.39.png)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/119594/images/5746251/building_shit.jpg)
The Design and Implementation of FusionDB
By Adam Retter
The Design and Implementation of FusionDB
Talk given at XML Prague 8 February 2019 - Prague
- 3,230