Data model
==========

It is important to understand the overall data model if you want
to work in, and to a lesser degree with, the Autoarchaeologist.

Unless otherwise noted, all these are python classes, which can
be freely subclassed.

Excavation
----------

An excavation is the top level data structure which turns a number
of input files into a number of HTML files.

* Drives the process of examination
* Drives the production of HTML output files
* Provide default interpretation of unloved artifacts
* Can decorate the HTML output files
* Provides the default values and parameters
* Creates the artifacts
* Provides various services to everything else

Relationships:

* Holds the artifacts (top- and excavated-)
* Holds the examiners
* Holds the default typecase
* Offers each artifact to each examiner once.

Artifact
--------

An artifact is a collection of one or more octets.

* Has a single unique sequential ordering of its octets
* Has a unique identifier (SHA256 of the unique sequential ordering)
* May be a concatenation of records
* May be a top-level artifact (= The files provided as input)

Relationships:

* May have notes
* May have types
* May have namespaces
* May have interpretations
* May have child artifacts
* May have a typecase (otherwise it will inherit one)

Record
------

A record is a fragment of an artifact.

* Has a key which is either None or a tuple
* Contains at least one octet

Relationships:

* Keys do not control the order of records in the artifact
* By convention disk-like artifacts use (cyl, head, sect) keys.
* By convention tape-like artifacts use (file, block) keys.

Examiner
--------

Tries to understand and interpret (some of) the artifacts, with
increasingly expensive checks to determine if it is an artifact
it can analyse, and then a full analysis.

Relationships:

* Is offered each artifact exactly once
* Can attach notes to artifacts
* Can attach types to artifacts
* Can attach namespaces to artifacts
* Can attach HTML or Unicode interpretations to artifacts
* Can, but normally should not, call other Examiners directly

Typecase
--------

Translates a character set into Unicode.

* Can translate a sequence of integers to short or long UTF-8 representation

TODO: Modal character sets like BAUDOT (telex, teleprinter)

Relationships:

* Is basically an array of Glyphs

Glyphs
------

Holds the information about one code point in a character set

* Has a "short" Unicode single-position, representation suitable for (hex)dumps
* Has a "long" Unicode representation which is "does the right thing" (for instance newlines)
* Can have flags (INVALID, IGNORE, EOF and others) which affect interpretation

Containers
----------

Some input artifacts will be raw bytes, for instance a hard disk
image, others will be container formats such as ``.IMD``, ``SIMH-TAP``
and ``ZIP``.

If we wanted to, we could ingest these containers, examine them
with ``OctetView`` and create the artifacts they contain for further
examination.

But our audience is not here to see how ``IMD`` files are constructed,
so it makes more sense to not instantiate the container files directly,
but only what they contain.

This is what ``containers`` do, and some of them even do it with the
``OctetView``, they just dont create an HTML interpretation.

Collections
-----------

AA will be used on artifacts from various public collections, Datamuseum.DK's
own BitArchive, Al Kossow & CHM's bitsavers.org and so on.

The collection classes makes it easier to pull in artifacts from
such well known collections, and cache the downloaded artifacts
to reduce traffic.