Artifacts in some depth
=======================

The ``Artifact`` class is an attempt to generalize "some data", and
given that we are dealing with weird old computer artifacts, and that
goes about as well as you probably expect.

All artifacts are an array of 8-bit bytes, index from zero.

The bytes might be fragmented, and each fragment (python class:
``Record``), may have a ``key``.

Fragment Keys
-------------

The segments in the artifact will often be, but are not guaranteed
to be sorted by key.

By convention, disk like artifacts should have a ``(cylinder, head,
sector)`` tuple as key.

By convention, tape like artifacts should have a ``(tape-file, record-number)``
tuple as key

Textfiles
---------

The only truly universal "kind" of file is the "textfile", and other
"text-based meta-formats" are built on top of that abstraction, for
instance PostScript, Intel HEX files or Comma-Separated-Values.

But how text-files are stored and structured depends on the system
software of the source computer:  IBM mainframe and midranges tend
to either fixed length records or Pascal-like strings, UNIX separates
lines with NL, MS-DOS uses CRNL, Rational R1000 does not even
byte-align the characters and so on.  And of course the character
sets are all over the place and sometimes includs a parity bit.

In order to not spill all these complications into examiners for
"text-based meta-formats", artifacts have a ``.textfile`` attribute
which defaults to ``None`` until an examiner puts a UTF8 string
there, containing the text of the artifact, as it would appear on
the original system.

Living with non-octets
----------------------

Today everything is organized on octets, to the point where "byte"
is always assumed to have 8 bits, but computers prior to that
convergence used all sorts of different word- and byte-sizes.

If the unit of data is smaller than 8 bits it is trivial to
ignore the unused bits.

But if the unit of data is wider than 8 bits, there is no good,
general and efficient way to handle that.

To get through GigaByte sized artifacts in reasonable time, we
have to build on the optimized fundamental python data-types.

Fortunately most surviving data-media, notably paper- and magnetic
tapes were limited to symbols of 8 bits or less, even 12-row punched
cards, usually hold only 8 bits of data per column, so the original
systems already serialized almost all I/O data to octets and
most artifacts we encounter are I/O data.

But internal datamedia were wider, memory, drums and disks
were usually formatted in terms of native machine-words, which may or may
not map nicely to 8-bit bytes.

For memory, be it ROM-type or core-dumps, :ref:`Pyreveng3` is probably
a more appropriate tool, and it has a very general memory class,
because "a program" is seldom measured in GigaBytes.

But once we have sufficient wider than octet organized artifacts,
we may revisit this short-coming.