TinyBuf: Binary Serialization

After reading some of Google's ProtoBuf documentation, and working extensively with Apache Thrift, I decided I could learn most effectively by creating my own proof-of-concept binary serialization framework. I would work in a largely test-driven fashion, and attempt to encode every bug I encountered in a reproducible unit test.

My proof of concept was going to be written in Python, although ideally the code would be written clearly enough that later implementation in Node, Scala or any other language wouldn't be difficult.

I decided early on that I would severely limit the scope of the protocol. Although user-defined data structures and common idioms like maps, lists and optional values would be supported, no effort would be made (as it is in Protobuf) to achieve robustness in the face of outdated configuration files.

I also decided that I'd focus on the interesting data serialization, and write an extremely simple parser – i.e. not a production quality system, but an incredibly basic line-by-line parser without much room for user interpretation.

Limitations aside, I did want to make this a streaming library. Python's extensive support for generators and hence lazy evaluation means that streaming and receiving streams of binary data (through e.g. a socket) should be particularly easy to implement.

I also wanted to use a strictly test-driven style of development. Although this isn't my usual way of working for things like web applications, it made sense to write good unit tests for this low-level code.

Finally, I didn't want to write anything that relied on generating code like Thrift or Protobuf do. I'd like to read in definition files and automatically create data types around those definitions.

The first thing to design was a varying-length integer type. As well as being useful itself, it would also form an important part of the type system for varying-length collections such as lists, maps and strings.

# tests.py
# coding=utf-8

from builtin_types import unsigned_int_to_bytes, read_unsigned_int

# This is a stupendously big number.
BIG_NUMBER = eval("9" * 100000)


def test_numbers_roundtrip():
    """
    Serialize and deserialize unsigned integers of varying length.
    """
    for num in (0, 1, 2, 178, 300, BIG_NUMBER):
        num2 = read_unsigned_int(unsigned_int_to_bytes(num))
        assert num2 == num

It was important to have a working "roundtrip" test. This would make sure that unsigned_int_to_bytes and its inverse, read_unsigned_int, would always compose together to give the identity function.

I also wanted to write a test for each component, making sure that the binary format was what I was expecting. I've used the LEB128 format to encode unsigned integers of arbitrary size into a stream of bytes. In short, the most significant bit of each byte is a "continuation bit" describing whether there are any more bytes in the number to read. The other 7 bits are interpreted as usual for a binary number.

def test_serialize_number():
    """Serialize a multi-byte number."""
    num = 18178
    assert [
        0b1000_0010,
        0b1000_1110,
        0b0000_0001
    ] == list(unsigned_int_to_bytes(num))


def test_deserialize_number():
    """Deserialize a multi-byte number."""
    bytestring = bytes([
        0b1010_0001,
        0b1100_1111,
        0b1000_0010,
        0b0100_0001
    ])
    assert 136357793 == read_unsigned_int(bytestring)

Next up is serializing text. Luckily in Python to encode a string of unicode text as UTF-8 bytes I just call the .encode() method. Similarly, .decode() will transform a UTF-8 bytes object into a Python native unicode string.

def test_text_roundtrip():
    """
    Serialize and deserialize arbitrary unicode text.
    """
    for text in ("", "a", "Hello, world!", "9" * 1000):
        assert text == read_text(text_to_bytes(text))

In this case I trust the (de)serialization performed by .encode() and .decode(), so I won't write unit tests around this feature – it's enough to know that the two functions read_text and text_to_bytes are each other's inverse.

I'll also create a boolean type. In Python, integers already act as booleans, so I'll just encode the type as if it were a variable-length integer. This might seem wasteful, but generally it's more useful to align data to the nearest byte – especially for network transport.

def test_boolean_roundtrip():
    """
    Serialize and deserialize booleans.
    """
    for b in (True, False):
        assert b == read_boolean(boolean_to_bytes(b))

Again, we've already tested the varying-length integer serialization, so we don't need to write individual test cases for that. Here's the very simple implementation:

def read_boolean(bytestream):
    """
    Read a boolean value from a bytestream.
    """
    return bool(read_unsigned_int(bytestream))


def boolean_to_bytes(boolean):
    """
    Serialize a boolean value to bytes.
    """
    yield from unsigned_int_to_bytes(int(boolean))

At this point, I refactored the functions which translated between Python native data types and bytestreams by placing them in static methods. This way I could define a more uniform interface for types (int, string etc.) which could also be honoured by any user-generated types.

class BuiltinType:
    @staticmethod
    def read(bytestream):
        """Read a value of this type from a bytestream."""
        raise NotImplementedError("read(bytestream)")

    @staticmethod
    def to_bytes(value):
        """Write a value of type type to a bytestream."""
        raise NotImplementedError("to_bytes(value)")


class Boolean:
    @staticmethod
    def read(bytestream):
        return bool(UnsignedInt.read(bytestream))

    @staticmethod
    def to_bytes(boolean):
        yield from UnsignedInt.to_bytes(int(boolean))


class String(BuiltinType):
    @staticmethod
    def to_bytes(text):
        encoded = text.encode("utf-8")
        yield from UnsignedInt.to_bytes(len(encoded))
        yield from encoded

    @staticmethod
    def read_n_bytes(n, bytestream):
        while n:
            yield next(bytestream)
            n -= 1

    @classmethod
    def read(cls, bytestream):
        length = UnsignedInt.read(bytestream)
        return bytes(cls.read_n_bytes(length, bytestream)).decode("utf-8")


class UnsignedInt(BuiltinType):
    @staticmethod
    def to_bytes(n):
        # Check the number is "unsigned" (i.e. non-negative).
        more_to_come = True

        while more_to_come:
            # Read the number's lowest 7 bytes.
            next_byte = n & 0b0111_1111

            # Chop those 7 bytes off the end.
            n >>= 7

            # If there's more data to come, set the most significant bit.
            if n:
                next_byte |= 0b1000_0000

            yield next_byte

            more_to_come = n > 0

    @staticmethod
    def read(bytestream):
        number = 0
        offset = 0

        for byte in bytestream:
            # Check for the continuation bit.
            should_read_more = byte & 0b1000_0000

            # Add this byte's value to our running total.
            number |= (byte & 0b0111_1111) << offset
            offset += 7

            # Stop reading if the continuation bit isn't present.
            if not should_read_more:
                break

        return number

To create structures involving these types, I'll define a simple Map type. This type is an unordered associative list of key-value pairs, where the key is a varying-length integer and the value is any other type. To decode the Map type, a client will need a mapping between the key and the value type. The user will define this mapping in a file similar to those used with Protobuf. However, this unit test will create the mapping programmatically.

def test_deserialize_map():
    mapping = Mapping(
        MapEntrySpec(1, "name", String),
        MapEntrySpec(2, "age", UnsignedInt),
        MapEntrySpec(3, "likes_chocolate", Boolean)
    )

    bytestream = bytes([
        3,  # Total number of entries.
        1, *String.to_bytes("Bede Kelly"),
        2, *UnsignedInt.to_bytes(20),
        3, *Boolean.to_bytes(True)
    ])

    assert {
        "name": "Bede Kelly",
        "age": 20,
        "likes_chocolate": True
    } == Map(mapping).read(bytestream)

Here, we're doing three things. Firstly, we create a Mapping. This defines the relationship between numeric keys (1, 2, and 3), key names ("name", "age" and "likes_chocolate"), and the types of values these keys will be associated with.

Secondly, we create a bytestream which matches the structure we'd like our Map to have. It starts off with the number of entries in the Map, then alternates between numeric keys and their values. The first numeric key is 1, and its value is the encoded string "Bede Kelly". I'm using the previously-tested to_bytes method for each of the inner types, as I think it makes the code clearer and reduces redundancy in the testing.

Thirdly, we create a Map type from the mapping, and try reading a value of that type from the bytestream. If it's equal to our ideal dictionary of data, our test passes!

Let's see how this works with the inverse: generating a stream of bytes from data according to a mapping.

def test_serialize_map():
    """
    Given a mapping and a dictionary of data, serialize
    a Map into a stream of bytes.
    """
    mapping = Mapping(
        MapEntrySpec(2, "colour", String),
        MapEntrySpec(1, "manufacturer", String),
        MapEntrySpec(3, "preowned", Boolean),
        MapEntrySpec(4, "miles_travelled", UnsignedInt)
    )

    car_data = {
        "preowned": True,
        "manufacturer": "Ford",
        "colour": "brown",
        "miles_travelled": 18562
    }

    assert bytes([
        4,  # Number of entries
        3, *Boolean.to_bytes(True),
        1, *String.to_bytes("Ford"),
        2, *String.to_bytes("brown"),
        4, *UnsignedInt.to_bytes(18562),
    ]) == bytes(
        Map(mapping).to_bytes(car_data)
    )

Again, we're creating a Mapping to link the numeric keys, the string names and the value types. But this time, we're checking that our serialization works correctly: that the to_bytes method of our new Map type outputs the correct sequence of bytes.

Finally, I'd like these Map types to complain when they don't get given all the values they need to satisfy their internal Mapping.

def test_serialize_map_fails_with_missing_values():
    """
    Make sure that if values are missing from the input,
    the Map.to_bytes method complains.
    """
    mapping = Mapping(
        MapEntrySpec(2, "colour", String)
    )

    with pytest.raises(ValueError):
        bytes(Map(mapping).to_bytes({}))

The implementation of this is rather long to post here, but as it looks currently it's saved here as a Gist.

Here's my current to-do list for this serialization framework:

Create a List type
Create an Option type
Create a SignedInt type
Create a Double type
Read user-definition files into Mappings
Test nested Map types

Edit: as of Monday 19th March most of these to-do items are done except creating a Double type. The current code is available at my github.

TinyBuf: Binary Serialization

Crazy Idea #78: Full-Stack Elm Architecture

Functional Inventory Design