Artifacts in some depth ======================= The ``Artifact`` class is an attempt to generalize "some data", and given that we are dealing with weird old computer artifacts, and that goes about as well as you probably expect. All artifacts are an array of 8-bit bytes, index from zero. The bytes might be fragmented, and each fragment (python class: ``Record``), may have a ``key``. Fragment Keys ------------- The segments in the artifact will often be, but are not guaranteed to be sorted by key. By convention, disk like artifacts should have a ``(cylinder, head, sector)`` tuple as key. By convention, tape like artifacts should have a ``(tape-file, record-number)`` tuple as key Textfiles --------- The only truly universal "kind" of file is the "textfile", and other "text-based meta-formats" are built on top of that abstraction, for instance PostScript, Intel HEX files or Comma-Separated-Values. But how text-files are stored and structured depends on the system software of the source computer: IBM mainframe and midranges tend to either fixed length records or Pascal-like strings, UNIX separates lines with NL, MS-DOS uses CRNL, Rational R1000 does not even byte-align the characters and so on. And of course the character sets are all over the place and sometimes includs a parity bit. In order to not spill all these complications into examiners for "text-based meta-formats", artifacts have a ``.textfile`` attribute which defaults to ``None`` until an examiner puts a UTF8 string there, containing the text of the artifact, as it would appear on the original system. Living with non-octets ---------------------- Today everything is organized on octets, to the point where "byte" is always assumed to have 8 bits, but computers prior to that convergence used all sorts of different word- and byte-sizes. If the unit of data is smaller than 8 bits it is trivial to ignore the unused bits. But if the unit of data is wider than 8 bits, there is no good, general and efficient way to handle that. To get through GigaByte sized artifacts in reasonable time, we have to build on the optimized fundamental python data-types. Fortunately most surviving data-media, notably paper- and magnetic tapes were limited to symbols of 8 bits or less, even 12-row punched cards, usually hold only 8 bits of data per column, so the original systems already serialized almost all I/O data to octets and most artifacts we encounter are I/O data. But internal datamedia were wider, memory, drums and disks were usually formatted in terms of native machine-words, which may or may not map nicely to 8-bit bytes. For memory, be it ROM-type or core-dumps, :ref:`Pyreveng3` is probably a more appropriate tool, and it has a very general memory class, because "a program" is seldom measured in GigaBytes. But once we have sufficient wider than octet organized artifacts, we may revisit this short-coming.