OctetView - A tutorial
======================

OctetView is by far the easiest way to take artifacts apart.

The basic idea is that that artifact is cut into non-overlapping
objects, each of which covers one or more octets in the artifact.

…and then the OctetView class more or less takes care of the rest.

The objects can be examined before they are and discarded or inserted
into the interpretation of the artifact.

Lets take an example:

.. code-block:: none

    from autoarchaeologist.base import octetview as ov
    
    class CBM900LOut(ov.OctetView):
    
        ''' CBM900 L.out binary format '''
    
        def __init__(self, this):
            super().__init__(this)
            header = LdHeader(self, 0)
            if header.l_magic.val != 0o407 or header.l_flag.val != 0x10:
                return
            header.insert()
            self.add_interpretation()

We are writing an examiner for the CBM900 "l.out" object file format,
but first we have to find out if the artifact is one.

After we have initialized the OctetView parent class, we create an
object starting at the first octet in the artifact.

The l.out files start out with this structure:

.. code-block:: none

    struct ldheader {
	int	l_magic;	/* Magic number */
	int	l_flag;		/* Flags */
	int	l_machine;	/* Type of target machine */
	vaddr_t	l_entry;	/* Entrypoint */
	size_t	l_ssize[NLSEG];	/* Segment sizes */
    };

But our view is the actual storage layout of the structure, on this
particular hardware, using that specific C-compiler, so we define
our ``LdHeader`` class like this:

.. code-block:: none

    class LdHeader(ov.Struct):
    
        def __init__(self, tree, lo):
            super().__init__(
                tree,
                lo,
                l_magic_=ov.Le16,
                l_flag_=ov.Le16,
                l_machine_=ov.Le16,
                l_entry_=ov.Le32,
                l_ssize_=ov.Array(9, ov.Le32, vertical=True),
                pad__=2,
                vertical=True,
            )

``tree`` is the OctetView we are working in, aka ``self`` in
the ``CBM900LOut`` class.

``lo`` is the address where this data structure lives.

The name of the next five arguments end in an underscore, so
they each define a field in the structure, by specifying which
class to instantiate for that field.

If we run the snippet above we get an interpretation which looks
like this:

.. code-block:: none

    0x000…030 LdHeader {
    0x000…030   l_magic = 0x0107	// @0x0
    0x000…030   l_flag = 0x0010	// @0x2
    0x000…030   l_machine = 0x0004	// @0x4
    0x000…030   l_entry = 0x00000030	// @0x6
    0x000…030   l_ssize = [	// @0xa
    0x000…030       [0x0]: 0x000000be
    0x000…030       [0x1]: 0x00000000
    0x000…030       [0x2]: 0x00000000
    0x000…030       [0x3]: 0x00000000
    0x000…030       [0x4]: 0x00000000
    0x000…030       [0x5]: 0x00000000
    0x000…030       [0x6]: 0x00000000
    0x000…030       [0x7]: 0x0000009a
    0x000…030       [0x8]: 0x0000004e
    0x000…030   ]
    0x000…030 }
    0x030…0ee   ab f1 2f […] 00 a9 fb   ┆  /[…]   ┆
    […]

The ``pad__=2`` field is missing because field arguments
which end in two underscores are not rendered.

The rest of the artifact is default-hexdumped, because we have not
created any objects which cover that part of it.

If we had not specified ``vertical=True`` to ``ov.Array``
the members of the array would all be on a single line,
and likewise, without ``vertical=True`` the entire ``LdHeader``
would be rendered on a single line.

Having structures and arrays horizontal while a data format is
reverse engineered makes it possible to ``grep -r`` all instances
of a struct in the entire excavation, to try to glean what this or
that field can contain and might mean.

Naked Structs
-------------

In normal structs the field attributes (ie: ``foo.field``)
are the field objects.

In practice most fields are plain numbers, and it is a bit of bother
to write ``foo.field.val`` to get their numerical value.

In "Naked structs", made so with the optional argument ``naked=True``,
the field attribute will be ``field.val`` if the added field has
that attribute, so that the numeric value is available with ``foo.field``.

Note that this snapshots ``struct.field.val`` so later modifications to it will
not be reflected in ``struct.field``.

Variable Structs
----------------

Variable structures are created like this:

.. code-block:: none

    class Something(ov.Struct):
    
        def __init__(self, tree, lo):
            super().__init__(
                tree,
                lo,
                width_=ov.Be24,
                name_=ov.Text(5),
                more=True,
            )
            if self.width.val < (1<<8):
                self.add_field("payload", ov.Octet)
            elif self.width.val < (1<<16):
                self.add_field("payload", ov.Be16)
            elif self.width.val < (1<<24):
                self.add_field("payload", ov.Be24)
            else:
                print("Somethings wrong", self)
                exit(2)
            self.done()

Field classes
-------------

Field classes should be subclassed from ``ov.Octets`` which
``ov.Struct`` also is, so yes:  Structs can be nested.

OctetView comes with a lot of handy subclasses already,
and most of them do what you expect:

*   Octets - some number of octets
*   Hidden - rendered as "Hidden", no matter how small or big
*   Opaque - rendered as "class-name[0x%x]"
*   HexOctets - rendered as hex string without spaces
*   Dump - octets but rendered with hex+text
*   This - an artifact
*   Text - strings
*   Array - Arrays of some field class
*   Octet - a single octet value
*   Le16, Le24, Le32, Le64 - Little endian integers
*   Be16, Be24, Be32, Be64 - Big endian integers
*   L2301, L1032 - Confused endian double word integers

``ov.Array`` is a factory which will return a class which
in the example above is used for an array of 9 little-endian 32 bit
numbers.
All the elements of an array has the same class, but they need not
have the same size.

``ov.Text`` is a factory which returns a class for a string of
a given length.

Field classes must have a ``render()`` method which is responsible for
how they will appear in the interpretation, so for instance a RC4000
timestamp can be defined like this:

.. code-block:: none

    class ShortClock(ov.Be24):
    
        def render(self):
            if self.val == 0:
                yield "                "
            else:
                ut = (word << 19) * 100e-6
                t0 = (366+365)*24*60*60
                yield time.strftime(
                    "%Y-%m-%dT%H:%M",
                    time.gmtime(ut - t0)
                )

Syntactic Sugar
---------------

There are two levels of syntactic sugar available on top of ``ov.Struct``.

The first level of syntactic sugar this:

.. code-block:: none

    class CDef():
        pointer = ov.Le32
        char = ov.Octet
        short = ov.Le16
        int = ov.Le32
        long = ov.Le64
        uid_t = ov.Le16
        gid_t = ov.Le16
        daddr_t = ov.Le32
    
    class Inode(ov.Struct):
        TYPES = CDef()
        FIELDS = [
            ( "di_mode", "short"),
            ( "di_nlink", "short"),
            ( "di_uid", "uid_t"),
            ( "di_gid", "gid_t"),
            […]
            ( "di_dbx", "daddr_t", 12),
            […]
        ]

As the example indicates, this allows common UNIX structures
to be "fleshed out" with platform specific variable types.

The type classes should be able to impose any alignment or
padding they require, but this has not been tested in practice
yet.

The advantage of using this form, is that subclasses can easily
edit the field list, for instance to insert or delete fields.

The second level of synctactic sugar makes that harder, but
it is really convenient:

.. code-block:: none

    class Inode(ov.Struct):
        TYPES = CDef()
        FIELDS = ov.cstruct_to_fields('''
            short di_mode;
            short di_nlink;
            uid_t di_uid;
            gid_t di_gid;
            […]
            daddr_t di_dbx[12]
            […]
        '''

(Pointer syntax and multidimensional arrays are not yet supported.)

When octets are too big
-----------------------

If octets are too big the the job, ``OctetView`` has a sibling called
``BitView``, which can do the exact same things, but with 8 times
higher resolution, and much more than 8 times slower.