@adamretter / adam.retter@googlemail.com
Consultant
Software Engineer
Data(base) Geek
Last 2.5 Years with The National Archives (UK)
Building a new Digital Archive for the UK -> DRI
CSV Schema Language
CSV Validation Tool
Archive Records of UK from OGDs, NGOs and Special Interest
Excellent at traditional Paper records
One of the largest collections in the world
Over 11 million historical Government and Public Records
However, most records today are not created on paper!
Predicted 2013 - 2020:
>6PB of Digital Records to Archive
50% of which will be Born Digital
2009: Existing Digital Records System will not cope...
2011: Build new Digital Records Infrastructure
Records arrive via:
Hard Disks (USB etc)
DVD / CD / Digital Video Cassette / Tape (mostly LTO 1 to 6)
SFTP
Load Records
Test, Secure and Examine Records (Pre-Ingest)
Extract Metadata and Archive (Ingest)
Enable Digital Archivists (Search, Retrieval and Edit)
Export Transcoded Records and Metadata (Publish / Sell)
What constitutes a Record?
Given a disk of files - What do you Accession?
How does DRI know what it should process and how?
One or more Files and Metadata (Technical, Provenance, Transcription, Closure)
Records Selection Process by OGD, provided as metadata
Search source for metadata and process described records
TNA creates Metadata Standards for their records
Expects suppliers to provide Metadata alongside files (records)
CSV was decided upon as file format for metadata
XML and RDF were both considered
Must be achievable by non-technical staff
Often Gov IT Departments are outsourced
Installing even free applications is prohibitive (cost)
Likely familiarity with MS Excel (and available)
Past experience has shown that if the barriers to entry are too high, then suppliers will not comply
TNA has complex metadata requirements
Conditional Values and Co-variance Constraints
Relationships: row -> row, csv -> csv, csv -> files
Errors are introduced
Human
Transcription mistakes
Rename .xls
file to .csv
Computer
Poorly implemented Metadata generation
MS Excel can hide/mangle data e.g. #NAME?
Commercial - Suppliers try and cut corners
Version 0.1 (Internal Only)
Command Line tool developed in Java
Validated metadata across 3 types of CSV files
Validation rules were expressed in Java DSL
Home Guard Collection (Proof of Concept)
82,800 Records Checked
>250,000 rows of CSV data
~4.5TB of JP2000 Images validated
Still... many failures detected!
However, faster feedback (Pre-Ingest).
Eventually... shared with digitisation supplier
Version 0.1 was nice... but in Version 0.2 can we have:
Validation rules DSL should
Application(s) should be
Started at TNA as text based DSL for CSV Validation Rules
As interest grew... Requirements exploded!
Now:
A generic CSV Schema Language
60+ Expression for forming Validation Rules
10+ High-level data types (Dates, Times, Numbers etc.)
Flexible Support for any tabular text data (CSV, TSV, etc.)
Open Standard (Currently... guided by TNA)
Freely available under MPL v2.0
Simple Plain-Text Expression
Composable by non-techies with text editor
Implicit Context
Natural to write, rules are per-column, applied row-by-row
Sane Defaults
CSV files come in all shapes, e.g. default to RFC 4180.
Streamable
CSV files may be large. Do not prohibit efficient processing.
NOT a Programming Language!
Powerful? Yes! For programmers? No!
A CSV Schema consists of:
Directives - modify behaviour of CSV parsing and rules
Rules - 1 per column, composed of expressions
first_name,last_name,gender,dob
Adam,Retter,33,M,1981-02-04
Elisabeth,Roberts,33,F,1980-11-13
version 1.0
@totalColumns 4
first_name: length(2, *)
last_name: length(2, *)
gender: is("M") or is("F") @optional
dob: xDate
Global Directives control parsing of CSV
"Huxley"$"feline"$"Short Haired Domestic"$"10"
"Precious"$"feline"$"Short Haired Domestic"$"6"
"Mac"$"canine"$"Dalmatian"$"12"
version 1.0
@separator '$' @quoted @totalColumns 4 @noHeader
name: notEmpty
class: is("feline") or is("canine")
breed: length(3, 255)
age: positiveInteger
Conditional Expressions and Co-Variance
name,animal,age,short description,notes
James,Mouse,4,,
Louise,Elephant,45,In good health,
version 1.0
name: notEmpty
animal: notEmpty
age: if($animal/is("mouse"), range(0, 3), positiveInteger)
"short description": length(*, 255) @optional
notes:
External Expressions (mainly file checks)
"id","fn","checksum","classifications"
"1","image1.jp2","54229abfcfa5649e7003b83dd4755294",""
"2","image2.jp3",3d0ad5a7a8ef3b1d4e6ea33e92e4d3b5,""
"3","folder1/","",""
version 1.0
id: positiveInteger unique
fn: (ends(".jp2") or ends("/")) and unique
checksum: if($fn/ends("/"), empty, checksum(file($fn, "MD5")))
classifications: regex("[0-9a-z]+(,[0-9a-z]+)*") @optional
Validates CSV data against CSV Schema
Reference Implementation
Runs on any JVM v6+ (written in Scala 2.11)
Command Line Interface
GUI Application
Scala API
Java API
Open source, available under MPL v2.0
It's all open:
CSV Schema collaborators would be nice
Developers for CSV Validator
Bugfixes
New Features
CSV Schema
More data types, specifically numeric types
Expressions: any, min, max, foward/backward etc.
CSV Validator
Multi-Threading External Expressions
Stream error messages
Review regarding CSV on the Web WG products
Diana Newton, Peter Malewski, David Underdown, Alex Green, Nicola Welch, Richard Williams and Ian Ireland
Ben Parker, David Ainslie, Andy Hicks and Jim Collins