CSV Validation
Adam Retter
Evolved Binary Ltd
@adamretter / adam.retter@googlemail.com
It all started with...
Talking about
- 
            
CSV Schema Language
 - 
            
CSV Validation Tool
 Full talk: http://adamretter.org.uk/presentations.xml#csvconf14
The National Archives
- 
            
Archive Records of UK from OGDs, NGOs and Special Interest
 - 
            
Excellent at traditional Paper records
- 
                    
One of the largest collections in the world
 - 
                    
Over 11 million historical Government and Public Records
 
 - 
                    
 - 
            
However, most records today are not created on paper!
- 
                    
Predicted 2013 - 2020:
- 
                            
>6PB of Digital Records to Archive
 - 
                            
50% of which will be Born Digital
 
 - 
                            
 - 
                    
2009: Existing Digital Records System will not cope...
- 
                            
2011: Build new Digital Records Infrastructure = ME :-)
 
 - 
                            
 
 - 
                    
 
Digital Repository Infrastructure
- 
            
Records arrive via:
- 
                    
Hard Disks (USB etc)
 - 
                    
DVD / CD / Digital Video Cassette / Tape (mostly LTO 1 to 6)
 - 
                    
SFTP
 
 - 
                    
 - 
            
Load Records
 - 
            
Test, Secure and Examine Records (Pre-Ingest)
 - 
            
Extract Metadata and Archive (Ingest)
 - 
            
Enable Digital Archivists (Search, Retrieval and Edit)
 - 
            
Export Transcoded Records and Metadata (Publish / Sell)
 
Q: Digital Preservation...
- 
            
What constitutes a Record?
 - 
            
Given a disk of files - What do you Accession?
 - 
            
How does DRI know what it should process and how?
 
A: Metadata!
- 
            
One or more Files and Metadata (Technical, Provenance, Transcription, Closure)
 - 
            
Records Selection Process by OGD, provided as metadata
 - 
            
Search source for metadata and process described records
 
Collecting Metadata
- 
            
TNA creates Metadata Standards for their records
 - 
            
Expects suppliers to provide Metadata alongside files (records)
 - 
            
CSV was decided upon as file format for metadata
- 
                    
XML and RDF were both considered
 - 
                    
Must be achievable by non-technical staff
- 
                            
Often Gov IT Departments are outsourced
 - 
                            
Installing even free applications is prohibitive (cost)
 - 
                            
Likely familiarity with MS Excel (and available)
 
 - 
                            
 
 - 
                    
 - 
            
Past experience has shown that if the barriers to entry are too high, then suppliers will not comply
 
CSV Metadata Problems
- 
            
TNA has complex metadata requirements
- 
                    
Conditional Values and Co-variance Constraints
 - 
                    
Relationships: row -> row, csv -> csv, csv -> files
 
 - 
                    
 - 
            
Errors are introduced
- 
                    
Human
- 
                            
Transcription mistakes
 - 
                            
Rename
.xlsfile to.csv 
 - 
                            
 - 
                    
Computer
- 
                            
Poorly implemented Metadata generation
 - 
                            
MS Excel can hide/mangle data e.g.
#NAME? 
 - 
                            
 - 
                    
Commercial - Suppliers try and cut corners
 
 - 
                    
 
CSV Validation
- 
            
Version 0.1 (Internal Only)
- 
                    
Command Line tool developed in Java
 - 
                    
Validated metadata across 3 types of CSV files
 - 
                    
Validation rules were expressed in Java DSL
 - 
                    
Home Guard Collection (Proof of Concept)
- 
                            
82,800 records checked
 >250,000 rows of CSV data
- 
                            
~4.5TB of JP2000 Images validated
 
 - 
                            
 
 - 
                    
 - 
            
Still... many failures detected!
- 
                    
However, faster feedback (Pre-Ingest).
 - 
                    
Eventually... shared with digitisation supplier
 
 - 
                    
 
CSV Validation
- 
            
Version 0.1 was nice... but in Version 0.2 can we have:
- 
                    
Validation rules DSL should
- Be External (no need to recompile)
 
- Writable by Domain Experts not Developers (no Java!)
 
- Easily sharable with suppliers
 
 - 
                    
Application(s) should be
- Freely available to suppliers
 
- Useable in DRI Pre-Ingest and Ingest processing
 
 
 - 
                    
 
The CSV Schema Language
- 
            
Started at TNA as text based DSL for CSV Validation Rules
 - 
            
As interest grew... Requirements exploded!
 - 
            
Now:
- 
                    
A generic CSV Schema Language
 - 
                    
60+ Expression for forming Validation Rules
 - 
    
10+ High-level data types (Dates, Times, Numbers etc.)
 - 
                    
Flexible Support for any tabular text data (CSV, TSV, etc.)
 - 
                    
Open Standard (Currently... guided by TNA)
 - 
                    
Freely available under MPL v2.0
 
 - 
                    
 

Design Principles of CSV Schema
- 
            
Simple Plain-Text Expression
- 
                    
Composable by non-techies with text editor
 
 - 
                    
 - 
            
Implicit Context
- 
                    
Natural to write, rules are per-column, applied row-by-row
 
 - 
                    
 - 
            
Sane Defaults
- 
                    
CSV files come in all shapes, e.g. default to RFC 4180.
 
 - 
                    
 - 
            
Streamable
- 
                    
CSV files may be large. Do not prohibit efficient processing.
 
 - 
                    
 - 
            
NOT a Programming Language!
- 
                    
Powerful? Yes! For programmers? No!
 
 - 
                    
 
CSV Schema 101
- 
            
A CSV Schema consists of:
- 
                    
Directives - modify behaviour of CSV parsing and rules
 - 
                    
Rules - 1 per column, composed of expressions
 
 - 
                    
 
CSV Data
first_name,last_name,gender,dob
Adam,Retter,33,M,1981-02-04
Elisabeth,Roberts,33,F,1980-11-13
CSV Schema
version 1.0
@totalColumns 4
first_name: length(2, *)
last_name: length(2, *)
gender: is("M") or is("F") @optional
dob: xDate
CSV Schema - Example 2
- 
            
Global Directives control parsing of CSV
 
CSV Data
"Huxley"$"feline"$"Short Haired Domestic"$"10"
"Precious"$"feline"$"Short Haired Domestic"$"6"
"Mac"$"canine"$"Dalmatian"$"12"
CSV Schema
version 1.0
@separator '$' @quoted @totalColumns 4 @noHeader
name: notEmpty
class: is("feline") or is("canine")
breed: length(3, 255)
age: positiveInteger
CSV Schema - Example 3
- 
            
Conditional Expressions and Co-Variance
 
CSV Data
name,animal,age,short description,notes
James,Mouse,4,,
Louise,Elephant,45,In good health,
CSV Schema
version 1.0
name: notEmpty
animal: notEmpty
age: if($animal/is("mouse"), range(0, 3), positiveInteger)
"short description": length(*, 255) @optional
notes:
CSV Schema - Example 4
- 
            
External Expressions (mainly file checks)
 
CSV Data
"id","fn","checksum","classifications"
"1","image1.jp2","54229abfcfa5649e7003b83dd4755294",""
"2","image2.jp3",3d0ad5a7a8ef3b1d4e6ea33e92e4d3b5,""
"3","folder1/","",""
CSV Schema
version 1.0
id: positiveInteger unique
fn: (ends(".jp2") or ends("/")) and unique
checksum: if($fn/ends("/"), empty, checksum(file($fn,"MD5")))
classifications: regex("[0-9a-z]+(,[0-9a-z]+)*") @optional
The CSV Validator
- 
            
Validates CSV data against CSV Schema
 - 
            
Reference Implementation
 - 
            
Runs on any JVM v6+ (written in Scala 2.11)
- 
                    
Command Line Interface
 - 
                    
GUI Application
 - 
    
Scala API
 - 
                    
Java API
 - 
                    
Open source, available under MPL v2.0
 
 - 
                    
 - Fast and efficient! Battle-tested against large datasets.
 
Future Work
- 
            
It's all open:
- 
                    
CSV Schema collaborators would be nice
 - 
                    
Developers for CSV Validator
 - 
                    
Bugfixes
 - 
                    
New Features
- 
                            
CSV Schema
- 
                                    
More data types, specifically numeric types
 - 
                                    
Expressions: any, min, max, foward/backward etc.
 
 - 
                                    
 - 
                            
CSV Validator
- 
                                    
Multi-Threading External Expressions
 - 
                                    
Stream error messages
 
 - 
                                    
 
 - 
                            
 
 - 
                    
 - 
            
Review regarding CSV on the Web WG products
 
CSV Validation Lightning Talk
By Adam Retter
CSV Validation Lightning Talk
Lightning Talk given at XML Summer School 2014
- 4,248