Data model¶
It is important to understand the overall data model if you want to work in, and to a lesser degree with, the Autoarchaeologist.
Unless otherwise noted, all these are python classes, which can be freely subclassed.
Excavation¶
An excavation is the top level data structure which turns a number of input files into a number of HTML files.
Drives the process of examination
Drives the production of HTML output files
Provide default interpretation of unloved artifacts
Can decorate the HTML output files
Provides the default values and parameters
Creates the artifacts
Provides various services to everything else
Relationships:
Holds the artifacts (top- and excavated-)
Holds the examiners
Holds the default typecase
Offers each artifact to each examiner once.
Artifact¶
An artifact is a collection of one or more octets.
Has a single unique sequential ordering of its octets
Has a unique identifier (SHA256 of the unique sequential ordering)
May be a concatenation of records
May be a top-level artifact (= The files provided as input)
Relationships:
May have notes
May have types
May have namespaces
May have interpretations
May have child artifacts
May have a typecase (otherwise it will inherit one)
Record¶
A record is a fragment of an artifact.
Has a key which is either None or a tuple
Contains at least one octet
Relationships:
Keys do not control the order of records in the artifact
By convention disk-like artifacts use (cyl, head, sect) keys.
By convention tape-like artifacts use (file, block) keys.
Examiner¶
Tries to understand and interpret (some of) the artifacts, with increasingly expensive checks to determine if it is an artifact it can analyse, and then a full analysis.
Relationships:
Is offered each artifact exactly once
Can attach notes to artifacts
Can attach types to artifacts
Can attach namespaces to artifacts
Can attach HTML or Unicode interpretations to artifacts
Can, but normally should not, call other Examiners directly
Typecase¶
Translates a character set into Unicode.
Can translate a sequence of integers to short or long UTF-8 representation
TODO: Modal character sets like BAUDOT (telex, teleprinter)
Relationships:
Is basically an array of Glyphs
Glyphs¶
Holds the information about one code point in a character set
Has a “short” Unicode single-position, representation suitable for (hex)dumps
Has a “long” Unicode representation which is “does the right thing” (for instance newlines)
Can have flags (INVALID, IGNORE, EOF and others) which affect interpretation
Containers¶
Some input artifacts will be raw bytes, for instance a hard disk
image, others will be container formats such as .IMD, SIMH-TAP
and ZIP.
If we wanted to, we could ingest these containers, examine them
with OctetView and create the artifacts they contain for further
examination.
But our audience is not here to see how IMD files are constructed,
so it makes more sense to not instantiate the container files directly,
but only what they contain.
This is what containers do, and some of them even do it with the
OctetView, they just dont create an HTML interpretation.
Collections¶
AA will be used on artifacts from various public collections, Datamuseum.DK’s own BitArchive, Al Kossow & CHM’s bitsavers.org and so on.
The collection classes makes it easier to pull in artifacts from such well known collections, and cache the downloaded artifacts to reduce traffic.