Nate Lincoln's Blog

The Simplest Possible Database

October 19, 2019

Introduction

Recently, I’ve spent a lot of time thinking about different designs for databases. I’ve decided to repurpose my relatable database as something that is far lower-level than a SQL interpreter, which was the original goal. Instead, it should be a small primitive that abstractions can be built on top of.

That begs the question though: What should that abstraction look like? What is the essence of a database? That is what I want to answer today.

What is a database?

I want to define a database as something that allows us to store records inside of tables, all inside of some kind of database, if I may be so meta.

So right there, we’ve identified the base set of primitives within our system: records, tables, and some database to manage the two.

The next step in the process is to identify what the ideal API for the database would be. The goal here isn’t necessarily to have support for everything right out of the box, that can be built up later. Instead, the goal here is to identify something flexible enough to support as many use cases as possible, while still being pleasant to work with.

What is a record?

I think it’s worth stepping back to discuss what a record is. I think most of the time, the mental image is something like the following:

struct User {
  id: u64,
  name: String,
  // other fields here
}

So we have a set of identifiers, and a piece of data associated with each identifier. But how do we represent this on disk?

Maybe you want to be space-efficient, and represent it with the following:

|-----------------------|
| id (8 bytes)          |
|-----------------------|
| name length (8 bytes) |
|-----------------------|
| name (unknown bytes)  |
|-----------------------|

But you also might want to write it out in a way that’s more self documenting, e.g. by attaching the names of the fields to the data within them:

{
  "id": 1,
  "name": "nathanielle"
}

This is more commonly referred to as JSON 😁.

Do we really care about the difference between these two formats? Honestly, who are we to decide how the developer lays out their record? In my opinion, we shouldn’t even be prescribing to the user that there is some schema that they should apply. That should be left up to some other abstraction to manage.

There’s a solution to all of this, and it’s so simple it’s practically cheating. If we define a record as an array of bytes, then it doesn’t matter how the user decides to represent their structure, because the only thing the database sees is a set of bytes.

Tables

So what is a table then? Usually, we think of something like:

CREATE TABLE users (
  id int primary key,
  name text
);

But we’ve already established that we don’t need things like a schema. Does that mean we can reduce it down to just:

create table users;

Yes, but do we really need to name our tables? Naming things is hard, after all. Plus then we have to go through the effort of comparing string values to each other, or hashing them, or something. It just isn’t it worth it.

Instead, tables will just be numbers. This does come with the downside that some kind of mapping must be maintained between table numbers and the kind of data represented within them, but at this low of level it’s no big deal. To get around this, a new table could be created that holds the names for all the other tables.

A note on indexing

One of my main worries about all of this is whether we’re painting ourselves into a corner here. One of the hallmarks of any well-made database is it’s ability to index certain attributes, in order to make queries fast. So far, our database has been designed without indexes in mind, which means we should at least take a step back and think about where we are.

At their core, indexes are mappings from values to records in the database. Each value could point to one or more records. I think there is a way to make this definition work with our design. First of all, we will need to enforce that each record in each table in our database has a unique primary key. Then, we can store records using the same format that a typical primary key index would: using a b-tree index. Indexes on other fields can also be represented as tables, with the key of the index being the primary key of the index table. I think it might not be the most efficient method, but for our goals of providing something simple and flexible, it just might work.

Conclusion

We’ve explored some design decisions that are important when designing databases. Along the way, we were able to cut away many of the aspects of databases that are theoretically unneeded, and only serve to complicate the design. We’ve arrived on some design that feels very simple, and hopefully will serve us well in the future. Among those design decisions are: forgoing any kind of schema, using numbers instead of strings for table names, and requiring primary keys for records in tables. I’m very interested to try implementing a database with these decisions, and to see how it feels to use.


Nathan Lincoln

Written by Nathan Lincoln. I study Computer Science at Missouri University of Science & Technology. I graduate in December. View my Resume!