Data model ========== It is important to understand the overall data model if you want to work in, and to a lesser degree with, the Autoarchaeologist. Unless otherwise noted, all these are python classes, which can be freely subclassed. Excavation ---------- An excavation is the top level data structure which turns a number of input files into a number of HTML files. * Drives the process of examination * Drives the production of HTML output files * Provide default interpretation of unloved artifacts * Can decorate the HTML output files * Provides the default values and parameters * Creates the artifacts * Provides various services to everything else Relationships: * Holds the artifacts (top- and excavated-) * Holds the examiners * Holds the default typecase * Offers each artifact to each examiner once. Artifact -------- An artifact is a collection of one or more octets. * Has a single unique sequential ordering of its octets * Has a unique identifier (SHA256 of the unique sequential ordering) * May be a concatenation of records * May be a top-level artifact (= The files provided as input) Relationships: * May have notes * May have types * May have namespaces * May have interpretations * May have child artifacts * May have a typecase (otherwise it will inherit one) Record ------ A record is a fragment of an artifact. * Has a key which is either None or a tuple * Contains at least one octet Relationships: * Keys do not control the order of records in the artifact * By convention disk-like artifacts use (cyl, head, sect) keys. * By convention tape-like artifacts use (file, block) keys. Examiner -------- Tries to understand and interpret (some of) the artifacts, with increasingly expensive checks to determine if it is an artifact it can analyse, and then a full analysis. Relationships: * Is offered each artifact exactly once * Can attach notes to artifacts * Can attach types to artifacts * Can attach namespaces to artifacts * Can attach HTML or Unicode interpretations to artifacts * Can, but normally should not, call other Examiners directly Typecase -------- Translates a character set into Unicode. * Can translate a sequence of integers to short or long UTF-8 representation TODO: Modal character sets like BAUDOT (telex, teleprinter) Relationships: * Is basically an array of Glyphs Glyphs ------ Holds the information about one code point in a character set * Has a "short" Unicode single-position, representation suitable for (hex)dumps * Has a "long" Unicode representation which is "does the right thing" (for instance newlines) * Can have flags (INVALID, IGNORE, EOF and others) which affect interpretation Containers ---------- Some input artifacts will be raw bytes, for instance a hard disk image, others will be container formats such as ``.IMD``, ``SIMH-TAP`` and ``ZIP``. If we wanted to, we could ingest these containers, examine them with ``OctetView`` and create the artifacts they contain for further examination. But our audience is not here to see how ``IMD`` files are constructed, so it makes more sense to not instantiate the container files directly, but only what they contain. This is what ``containers`` do, and some of them even do it with the ``OctetView``, they just dont create an HTML interpretation. Collections ----------- AA will be used on artifacts from various public collections, Datamuseum.DK's own BitArchive, Al Kossow & CHM's bitsavers.org and so on. The collection classes makes it easier to pull in artifacts from such well known collections, and cache the downloaded artifacts to reduce traffic.