Web Analytics Made Easy - Statcounter
BaseKV
Sign InSign Up
Back to Articles
Featured

The Bet on a Disk-Based Redis: Why BaseKV Puts Data on Disk

Part 1 of the BaseKV internals series: why a disk-based Redis on BoltDB, the one-file two-bucket layout, and the read-fan-out, single-writer concurrency model.

BaseKV Team10 min read
basekv-internalsboltdbarchitectureredisdisk-basedconcurrency

Most of the time, the cost of an in-memory data store is not the database, it is the RAM. Redis is fast because it keeps everything resident in memory, and that is exactly why it gets expensive: every byte you store is a byte of RAM you rent, whether or not anyone is reading it this second. For a working set that is genuinely hot and small, that trade is worth it. But a large share of real key-value workloads are neither. The data set is bigger than memory, most of it is cold, and the application would happily accept a few milliseconds of latency in exchange for not paying to keep a hundred gigabytes of barely-touched keys in RAM. BaseKV is a bet that this case is common enough to build a database around.

This is the first article in a seven-part series taking apart the internals of BaseKV, a single-binary, disk-based key-value engine written in Go that speaks the Redis, Memcached, and a subset of the DynamoDB HTTP protocols against one file on disk. We are writing this series because BaseKV has near-zero traction and we think the most honest way to earn a developer's attention is to show the design, including the places where it gives something up. This part covers the core thesis and the top-level shape. The rest of the series goes deep on the parts that are genuinely interesting to build.

Here is the full map, and we link each piece as we go:

  1. The bet on a disk-based Redis (this one)
  2. Fitting Redis data types into a B+tree
  3. Sorted sets on disk, an honest trade-off
  4. SCAN cursors over a live B+tree
  5. Three protocols, one file
  6. Durability, the write loop, and TTL
  7. A RedisGraph subset on a KV engine

The thesis: put the data on disk and let the page cache be the hot tier

The central design choice is to store the data on disk and treat the operating system's page cache as the hot tier, rather than reserving RAM ourselves. The reasoning runs like this. In an in-memory store, you size your instance by total data set. In a disk-based store, you size it by working set, because only the pages you actually touch need to be resident, and the kernel keeps those in the page cache for you. A read of a recently-touched key is served from cached pages and is fast. A read of a cold key faults a page in from disk and is slower. The data set as a whole can be far larger than RAM, bounded by disk rather than by memory, and you stop paying to keep cold data warm.

That is the upside, and we want to be precise about the downside in the same breath. A disk-based store is generally slower than an in-memory one, especially for cold reads and for write-heavy workloads, which we get to below. The pitch is not "as fast as Redis for less money." It is "fast enough for latency-tolerant workloads whose data set has outgrown what you want to pay to keep in RAM." If your working set fits comfortably in memory and you need every microsecond, in-memory Redis is the right tool and we will not pretend otherwise. We have written more about where this line falls in Disk-Backed Redis, Explained and Key-Value vs Redis.

Why BoltDB as the foundation

A design like this lives or dies on its storage engine, and we did not write our own. BaseKV is built on BoltDB (the maintained bbolt fork), and the reasons are specific.

BoltDB is a single file. The entire database is one basekv.db file, which makes backups, container volumes, and "copy it somewhere else" operations trivial. There is no WAL directory, no segment files to reconcile, no separate index. This matters more than it sounds, because operational simplicity is a feature we are deliberately optimizing for.

BoltDB is an mmap'd B+tree. It maps the file into the process address space and reads keys directly out of those mapped pages, which is exactly the page-cache-as-hot-tier model the thesis depends on. We do not maintain our own block cache and try to outsmart the kernel; we let the kernel do what it is good at. The README is blunt about the consequence: BoltDB has no fixed memory limit you can tune, and less RAM simply means more page faults and more disk IO. That is the trade, stated plainly.

BoltDB gives us real ACID transactions, with multiple concurrent read transactions and a single read-write transaction at a time, all serializable. We get a View() for reads and an Update() for writes, and we get crash safety: a committed write survives a process crash or power loss. We build the entire Redis command surface on top of those two primitives. The B+tree also hands us sorted key iteration for free, which is the lever that makes SCAN and sorted sets implementable at all. Parts 3 and 4 of this series are entirely about exploiting that ordering.

The cost of choosing BoltDB is its single-writer model, and we will come back to it, because it shapes everything about how BaseKV handles writes.

The top-level shape: one file, two roots, eight sub-buckets

Open basekv.db and the layout is small enough to hold in your head. At the top level there are two root buckets. A data bucket holds all the user data. An expiry bucket holds TTL metadata, keyed so that the expiry record for mykey lives under expiry:mykey. Keeping expiry in its own bucket lets a background sweeper scan only TTL records without walking the entire keyspace, which Part 6 covers in detail.

Inside data, we do not throw every key into one flat namespace. We split by Redis type into eight sub-buckets, created once at startup:

stringsBucketName = []byte("strings")
listsBucketName   = []byte("lists")
setsBucketName    = []byte("sets")
hashesBucketName  = []byte("hashes")
zsetsBucketName   = []byte("zsets")
streamsBucketName = []byte("streams")
hllBucketName     = []byte("hll")
graphsBucketName  = []byte("graphs")

Every command resolves to one of these by data type before it touches a key. A helper takes a type string and returns the matching sub-bucket out of the open transaction, so a GET reaches into strings, an LPUSH into lists, a ZADD into zsets, and so on. This is not cosmetic. Isolating types into separate B+trees keeps each type's keys contiguous, makes type-scoped iteration cheap, and avoids one type's key encoding colliding with another's. How each type actually encodes its values and members into these buckets is the subject of Part 2, and the sorted-set case is interesting enough to get its own article in Part 3.

The concurrency model: reads fan out, writes funnel through one goroutine

This is the part of the design we find most worth explaining, because it is where the engine's character comes from.

Reads are easy. BoltDB allows many concurrent read transactions, so BaseKV serves reads with a goroutine per connection, each opening its own View(). A read does not block other reads, and it does not block the writer either, because BoltDB's MVCC gives every read transaction a consistent snapshot. Read concurrency scales with your cores and your IO.

Writes are the opposite, by necessity. BoltDB permits exactly one read-write transaction at a time. Rather than let connection goroutines contend on a mutex around Update(), BaseKV serializes all writes through a single dedicated goroutine that pulls write requests off a channel. The struct carries a writeCh chan *writeRequest, and at startup one goroutine runs the loop that drains it:

func (db *DB) runWriteLoop() {
    for req := range db.writeCh {
        var err error
        if req.batch {
            err = db.bdb.Batch(func(btx *bbolt.Tx) error { ... })
        } else {
            err = db.bdb.Update(func(btx *bbolt.Tx) error { ... })
        }
        req.done <- err
    }
}

Every write path funnels into this. A connection goroutine that wants to write builds a writeRequest holding the transaction function and a done channel, pushes it onto writeCh, and blocks waiting for the result. The write loop executes requests strictly one at a time and signals completion back through done. In effect, the single-consumer channel is the lock. There is no explicit mutex around the BoltDB write transaction because there does not need to be one: only runWriteLoop ever calls Update, so the serialization is structural rather than lock-based.

There are two refinements worth noting now and unpacking later. First, requests can ask for Batch instead of Update, which lets BoltDB coalesce several small concurrent writes into one fsync, tuned here with a short batch delay and a batch size cap. Second, a Redis MULTI/EXEC block needs every queued command to run inside one transaction, so during an EXEC the engine marks an active transaction owned by the executing goroutine and routes that goroutine's nested writes into the already-open transaction instead of re-queuing them onto the channel, which would deadlock. Part 6 walks the write loop and these refinements end to end.

The honest central trade-off

Now the trade we promised to be straight about. Because every write goes through one goroutine and, by default, every write transaction calls fsync before it returns, write throughput is bounded by fsync latency. You cannot commit faster than your disk can durably flush, and you cannot parallelize your way around it, because there is exactly one writer. Batching helps by amortizing one fsync across several queued writes, and a fast NVMe disk helps a lot, but the ceiling is real. A write-heavy workload on BaseKV will bottleneck on durable write throughput, full stop. This is inherent to the bet: we chose crash-safe single-file durability and we pay for it on the write path.

There is an escape hatch, and it is exactly as dangerous as it sounds. Starting with --no-sync disables fsync, so writes return as soon as they hit the page cache and throughput jumps dramatically. The cost is that a crash or power loss can lose or reorder recent writes. We expose it because for some workloads, a rebuildable cache, an import job, a load test, losing the last few seconds of writes on a crash is acceptable and the speed is worth it. For anything where a write must survive, leave fsync on and size the workload accordingly. We say this in the startup logs too: enabling no-sync prints a warning that data loss may occur on crash. There is no async or periodic-fsync middle mode today; it is durable-and-bounded or fast-and-lossy.

Reads, to close the loop, do not pay any of this. They scale out across goroutines, they never touch the write loop, and a warm page cache serves them quickly. The asymmetry is the whole point. BaseKV is shaped for workloads that read far more than they write, that tolerate millisecond-class latency, and that have more data than they want to keep in RAM. That is a real and common shape, and it is the one we built for.

What the rest of the series covers

With the foundation in place, the interesting engineering is in how we made a rich Redis surface, roughly 85 percent of the command set, fit on top of an ordered KV B+tree without a custom storage engine. Part 2 shows how each Redis type encodes into its bucket. Part 3 is the honest accounting of sorted sets on disk, where Redis's dual score-and-member ordering is hardest to reproduce. Part 4 explains stateful SCAN cursors over a tree that can change between calls. Part 5 covers serving three wire protocols from the one file. Part 6 is the deep dive on durability, the write loop, and TTL expiry. Part 7 takes apart the RedisGraph-compatible subset we built on plain KV primitives.

If you want the conceptual background before the implementation details, Persistent Key-Value Storage and Disk-Backed Redis, Explained set the stage, and Key-Value vs Redis frames the trade-off this whole series is built around.


Next in the series: Fitting Redis data types into a B+tree. Related reading: Persistent Key-Value Storage, Disk-Backed Redis, Explained, Key-Value vs Redis.