@adamretter / adam.retter@googlemail.com
Software Engineer
Mostly Scala and Java
Anything really...
Open Source Hacker
Consultant
Author
"eXist" for O'Reilly
W3C (Invited Expert)
XQuery
CSV on the Web, Provenance
Memory is finite
Stories (facts) become distorted over time
Point of truth
Historical Value
The National Archives
Preserving the nations history
Governmental Records
Public Hearings
Special Collections (e.g. Records of LOCOG)
Archive Records of UK from OGDs, NGOs and Special Interest
Excellent at traditional Paper records
One of the largest collections in the world
Over 11 million historical Government and Public Records
However, most records today are not created on paper!
Predicted 2013 - 2020:
>6PB of Digital Records to Archive
50% of which will be Born Digital
2009: Existing Digital Records System will not cope...
2011: Build new Digital Records Infrastructure = ME :-)
Records arrive via:
Hard Disks (USB etc)
DVD / CD / Digital Video Cassette / Tape (mostly LTO 1 to 6)
SFTP
Load Records
Test, Secure and Examine Records (Pre-Ingest)
Extract Metadata and Archive (Ingest)
Enable Digital Archivists (Search, Retrieval and Edit)
Export Transcoded Records and Metadata (Publish / Sell)
PRONOM - File format database
http://apps.nationalarchives.gov.uk/pronomDROID - File Identification Tool
http://digital-preservation.github.io/droidCSV Schema and CSV Validator
http://digital-preservation.github.io/csv-schemaUTF-8 Validator
https://github.com/digital-preservation/utf8-validatorShadoop - Scala DSL for Hadoop
https://github.com/adamretter/shadoop
"In library and archival science, digital preservation is a formal endeavor to ensure that digital information of continuing value remains accessible and usable. It involves planning, resource allocation, and application of preservation methods and technologies, and it combines policies, strategies and actions to ensure access to reformatted and "born-digital" content, regardless of the challenges of media failure and technological change."
"The goal of digital preservation is the accurate rendering of authenticated content over time."
- Taken from Wikipedia: https://en.wikipedia.org/wiki/Digital_preservation
File Identification and Analysis / Hardware Analysis
Emulation vs. Migration
Multiple copies on diverse media at multiple sites
Media Retention Policy - Frequently renewed and rewritten
Duh! The same reasons as archiving records.
Posterity
Personally: Mining!
The Internet Archive?
UK Government Web Archive?
Web Crawling :-(
Web Pages / File Downloads
Databases and Query End-points
REST
SPARQL
Context
Provenance
e.g. CSV Data without headings and/or schema
Crawling RDF and SPARQL
Self-describing or described data
Some formats are better than others!
Human readable Schema?
Machine readable Schema?
Consider provenance
Even a timestamp in the data is very useful!
Is YOUR open data ammenable to crawling?
How to archive without crawling?