Semantic data storage

20/04/2019

Tags: dataconventionmemorystorage

Semantic data storage

One of the things which have always been mysterious to me is how can computers make sense of data stored arbitrarily in a memory device. How does the computer know what the 1s and the 0s mean? This isn’t at all a complicated topic, but it’s a topic no one ever really bothers to explain.

Binary data in memory

Let’s take a look at a couple of hardware devices where binary data can be stored.

The magnetic core

Ferrite rings look a bit like a chainmail. A current is circulated through them, and the resulting orientation of the magnetic field signifies a binary 0 or 1. This is where the term core dump actually comes from (“dumping the magnetic core”).

The magnetic tape

Data can also be stored on a magnetic tape. This is an interesting example, because it closely relates with a mental model, that of the Turing machine, where symbols are also stored on a tape.

The hard disk drive (HDD)

An HDD typically has several magnetic platters in the shape of disks, superimposed vertically in a tower shape, hinging on an axle. They have circular concentric tracks, which in turn are divided into sectors. In a sense, although a track is concentric, data on it is still stored in a linear fashion.

Convention

As we can see, data is often stored one-dimensionally, which is to say a single value can be used to address it (an X coordinate for instance). In the case of an HDD, you have to know the specific platter, the track, and the sector on the track, so three values. No matter how the data looks spatially, the question is, how do you interpret it?

Convention

Convention is always involved, although in varying degrees. The word itself comes from latin convenire, and it means a formal agreement, or covenant. Indeed the principle of communication between people is that they both assume a format with an associated set of rules regarding how to interpret it. When talking, this is fairly implicit, but for computer storage, we create this convention more or less from the ground up.

Interpreting serial data

We will restrict the conversation to a string of 1s and 0s, as data is almost always serialized with our current technology. Usually, such a string can encode multiple concepts, but let’s start with a string that encodes a possible value for a single concept. Here, convention comes into play, as we have to agree on the concept: does that string represent a type of flower? The name of an animal species?

When we move on to multiple concepts stored serially, the following problem arises: where does the data encoding one concept end and the data encoding the next begin?

If by convention, fields have a fixed length, then nothing is needed, except knowledge of that convention. But sometimes, data must have variable length. What if you want to store a book? Some books can have 200 pages, some 1000. Do you pick a ludicrous size for the data, and waste a massive amount of memory? This is a problem.

As it turns out, two very simple strategies are often used to take care of this.

Signatures

Do not confuse these with authenticity-related signatures! Signature are pre-defined values (guess how they’re established…) which mark the beginning of the next field after a variable-length field. You can encounter these signature values in the ubiquitous ZIP file format, marking the sections of the file.

Length fields

A simple alternative is to use a length field which states the length of the afferent variable-length field. This is a strategy used in another ubiquitous technology, the Internet protocol stack, namely in the IP packet headers.


Of course, these aren’t the only strategies, more complicated strategies exist (look at parsers for programming languages, which handle exactly this), but they are very often used due to their simplicity when at a low hardware level.

Copyright © 2020 David Cian