OctetView - A tutorial

OctetView is by far the easiest way to take artifacts apart.

The basic idea is that that artifact is cut into non-overlapping objects, each of which covers one or more octets in the artifact.

…and then the OctetView class more or less takes care of the rest.

The objects can be examined before they are and discarded or inserted into the interpretation of the artifact.

Lets take an example:

from autoarchaeologist.base import octetview as ov

class CBM900LOut(ov.OctetView):

    ''' CBM900 L.out binary format '''

    def __init__(self, this):
        super().__init__(this)
        header = LdHeader(self, 0)
        if header.l_magic.val != 0o407 or header.l_flag.val != 0x10:
            return
        header.insert()
        self.add_interpretation()

We are writing an examiner for the CBM900 “l.out” object file format, but first we have to find out if the artifact is one.

After we have initialized the OctetView parent class, we create an object starting at the first octet in the artifact.

The l.out files start out with this structure:

struct ldheader {
    int     l_magic;        /* Magic number */
    int     l_flag;         /* Flags */
    int     l_machine;      /* Type of target machine */
    vaddr_t l_entry;        /* Entrypoint */
    size_t  l_ssize[NLSEG]; /* Segment sizes */
};

But our view is the actual storage layout of the structure, on this particular hardware, using that specific C-compiler, so we define our LdHeader class like this:

class LdHeader(ov.Struct):

    def __init__(self, tree, lo):
        super().__init__(
            tree,
            lo,
            l_magic_=ov.Le16,
            l_flag_=ov.Le16,
            l_machine_=ov.Le16,
            l_entry_=ov.Le32,
            l_ssize_=ov.Array(9, ov.Le32, vertical=True),
            pad__=2,
            vertical=True,
        )

tree is the OctetView we are working in, aka self in the CBM900LOut class.

lo is the address where this data structure lives.

The name of the next five arguments end in an underscore, so they each define a field in the structure, by specifying which class to instantiate for that field.

If we run the snippet above we get an interpretation which looks like this:

0x000…030 LdHeader {
0x000…030   l_magic = 0x0107        // @0x0
0x000…030   l_flag = 0x0010 // @0x2
0x000…030   l_machine = 0x0004      // @0x4
0x000…030   l_entry = 0x00000030    // @0x6
0x000…030   l_ssize = [     // @0xa
0x000…030       [0x0]: 0x000000be
0x000…030       [0x1]: 0x00000000
0x000…030       [0x2]: 0x00000000
0x000…030       [0x3]: 0x00000000
0x000…030       [0x4]: 0x00000000
0x000…030       [0x5]: 0x00000000
0x000…030       [0x6]: 0x00000000
0x000…030       [0x7]: 0x0000009a
0x000…030       [0x8]: 0x0000004e
0x000…030   ]
0x000…030 }
0x030…0ee   ab f1 2f […] 00 a9 fb   ┆  /[…]   ┆
[…]

The pad__=2 field is missing because field arguments which end in two underscores are not rendered.

The rest of the artifact is default-hexdumped, because we have not created any objects which cover that part of it.

If we had not specified vertical=True to ov.Array the members of the array would all be on a single line, and likewise, without vertical=True the entire LdHeader would be rendered on a single line.

Having structures and arrays horizontal while a data format is reverse engineered makes it possible to grep -r all instances of a struct in the entire excavation, to try to glean what this or that field can contain and might mean.

Naked Structs

In normal structs the field attributes (ie: foo.field) are the field objects.

In practice most fields are plain numbers, and it is a bit of bother to write foo.field.val to get their numerical value.

In “Naked structs”, made so with the optional argument naked=True, the field attribute will be field.val if the added field has that attribute, so that the numeric value is available with foo.field.

Note that this snapshots struct.field.val so later modifications to it will not be reflected in struct.field.

Variable Structs

Variable structures are created like this:

class Something(ov.Struct):

    def __init__(self, tree, lo):
        super().__init__(
            tree,
            lo,
            width_=ov.Be24,
            name_=ov.Text(5),
            more=True,
        )
        if self.width.val < (1<<8):
            self.add_field("payload", ov.Octet)
        elif self.width.val < (1<<16):
            self.add_field("payload", ov.Be16)
        elif self.width.val < (1<<24):
            self.add_field("payload", ov.Be24)
        else:
            print("Somethings wrong", self)
            exit(2)
        self.done()

Field classes

Field classes should be subclassed from ov.Octets which ov.Struct also is, so yes: Structs can be nested.

OctetView comes with a lot of handy subclasses already, and most of them do what you expect:

  • Octets - some number of octets

  • Hidden - rendered as “Hidden”, no matter how small or big

  • Opaque - rendered as “class-name[0x%x]”

  • HexOctets - rendered as hex string without spaces

  • Dump - octets but rendered with hex+text

  • This - an artifact

  • Text - strings

  • Array - Arrays of some field class

  • Octet - a single octet value

  • Le16, Le24, Le32, Le64 - Little endian integers

  • Be16, Be24, Be32, Be64 - Big endian integers

  • L2301, L1032 - Confused endian double word integers

ov.Array is a factory which will return a class which in the example above is used for an array of 9 little-endian 32 bit numbers. All the elements of an array has the same class, but they need not have the same size.

ov.Text is a factory which returns a class for a string of a given length.

Field classes must have a render() method which is responsible for how they will appear in the interpretation, so for instance a RC4000 timestamp can be defined like this:

class ShortClock(ov.Be24):

    def render(self):
        if self.val == 0:
            yield "                "
        else:
            ut = (word << 19) * 100e-6
            t0 = (366+365)*24*60*60
            yield time.strftime(
                "%Y-%m-%dT%H:%M",
                time.gmtime(ut - t0)
            )

Syntactic Sugar

There are two levels of syntactic sugar available on top of ov.Struct.

The first level of syntactic sugar this:

class CDef():
    pointer = ov.Le32
    char = ov.Octet
    short = ov.Le16
    int = ov.Le32
    long = ov.Le64
    uid_t = ov.Le16
    gid_t = ov.Le16
    daddr_t = ov.Le32

class Inode(ov.Struct):
    TYPES = CDef()
    FIELDS = [
        ( "di_mode", "short"),
        ( "di_nlink", "short"),
        ( "di_uid", "uid_t"),
        ( "di_gid", "gid_t"),
        […]
        ( "di_dbx", "daddr_t", 12),
        […]
    ]

As the example indicates, this allows common UNIX structures to be “fleshed out” with platform specific variable types.

The type classes should be able to impose any alignment or padding they require, but this has not been tested in practice yet.

The advantage of using this form, is that subclasses can easily edit the field list, for instance to insert or delete fields.

The second level of synctactic sugar makes that harder, but it is really convenient:

class Inode(ov.Struct):
    TYPES = CDef()
    FIELDS = ov.cstruct_to_fields('''
        short di_mode;
        short di_nlink;
        uid_t di_uid;
        gid_t di_gid;
        […]
        daddr_t di_dbx[12]
        […]
    '''

(Pointer syntax and multidimensional arrays are not yet supported.)

When octets are too big

If octets are too big the the job, OctetView has a sibling called BitView, which can do the exact same things, but with 8 times higher resolution, and much more than 8 times slower.