Bitstore/Storage

Fra DDHFwiki
Spring til navigation Spring til søgning

Rationale

The bottom layer of our bitstore is the Storage module.

The sole role of the storage module is to keep bits safe, until we ask for them again.

For this we have chosen the WARC fileformat, a really simple container file format, standardized in ISO 28500.

WARC gives us the following features to work with:

  • Append only fileformat

This is both convenient and incredibly important to preserve the integrity of an archive against software bugs. (The downside is that you need a separate index-cache to find anything fast.)

  • Compression.

Anybody can gzip(1) a file, but WARC have specified how to do it so that individual objects can still be pulled out from the middle, without having to ungzip the entire thing.

  • Segmentation

In theory, there is no upper limit to the size of the bits we want to store, but for the preservation aspect, there are very good reasons for the result to end up in files of finite size which can fit onto archival media, such as DVD's or tapes. WARC allows us to split stored objects for preservation, and rejoin the parts for retrieval.

Interface/API

Current thinking is to dedicate a jail to the bitstore/storage, offering a simple HTTP based service, with access restricted to blessed clients.

The three basic verbs would be:

   POST /add
      Returns the WARC-Record-ID of the new object as a HTTP response header.
      Attempting to add a duplicate will do nothing, but return success with indication of duplicate happening.
   
   {GET|HEAD} /i/<ID>
      Returns the bits in the body and the headers as HTTP-headers.
   
   {GET|HEAD} /next
      Returns the next object in storage, as if a {GET|HEAD} /i/<ID> had been done on its ID.

This API is not meant for direct human/browser consumption, but rather as backend for other applications.

However the API can be used as a direct provider of the stored objects, for instance images in web-pages can link directly to the WARc-Record-ID.

WARC-Record-IDs

We will use:

   http://bits.datamuseum.dk/i/sha256_hexdumped

NB: This is a pretty definitive decision, changing it subsequently will require rewriting/regenerating pretty much everything, everywhere.

WARC-Record-IDs should be strongly linked to the stored object, and independent of everything else, including time and space and any future perception or metadata for this object.

The easiest way to create a unique identifier, is to use a strong hash over the contents itself and this saves time since we already do that anyway for the WARC-Payload-Digest header.

WARC Filenames

Our filenames will take the form "DDHF-%08d.warc.gz" where the '%08d' gets replaced by a serial number [1...]

The filenames are only visible internally to the implementation of the bit-store.

Block/Payload digests

We will use SHA256 lower-case hexdumped, since this is conveniently available to users with the sha256(1) program and SHA256(3) API.

Data structure & Sanity

The "POST /add" processing will include sanity checks before adding records to the WARC file.

All objects must have a "Content-Length:", "Content-Type:" "WARC-Type" and "WARC-Payload-Digest" header.

The MARC-Payload-Digest must match the provided payload.

WARC-Type can take two values, and the value determines further sanity checks as follows:

WARC-Type: resource

The "Content-Type:" header must be found on the white-list.

WARC-Type: metadata

Must have "Content-Type: text/xml" (with DCMI metadata, but we do not test/check that here)

Must have a "WARC-Refers-To:" header, which points to an existing resource record.

Implementation details

Will be a C-program for robustness.

POST will be serialized, to avoid races around end/start of WARC-files and general sanity preservation.

GET/HEAD can be multithreaded.

SHA256 to WARC-file coordinates index-cache will be stored in sqlite3 database.

A separate test-storage will be run on a different TCP-port, to not pollute the "official" storage with test records.

Cron-jobs will be used to test the integrity of the WARC files on a regular basis.

Caching will not be performed at this level.

References