Web Analytics Made Easy - Statcounter
BaseKV
Sign InSign Up
Back to Articles

BaseKV Internals: Durability, the Write Loop, and TTL

How BaseKV makes writes durable through one single-writer goroutine, runs MULTI/EXEC atomically, and expires keys with lazy reads, a heap, and a background sweeper.

BaseKV Team11 min read
basekvdurabilitywrite-loopttltransactionsboltdb

A key-value store earns trust on two questions that have nothing to do with how fast it serves a GET. The first is: when a write returns success, is it actually safe, or did we just hand the bytes to the operating system and hope? The second is: when a key is supposed to expire, what makes it actually disappear, and how much work does that cost? Both questions land in the same corner of the engine, because in BaseKV every mutation passes through one goroutine and the durability guarantee, the transaction semantics of MULTI/EXEC, and the machinery of TTL all hang off that single write path. This part takes that path apart.

You are reading part 6 of our seven-part series on the internals of BaseKV, a single-binary disk-based key-value engine written in Go that speaks Redis, Memcached, and a subset of the DynamoDB HTTP protocols against one file on disk. The full map:

  1. The bet on a disk-based Redis
  2. Fitting Redis data types into a B+tree
  3. Sorted sets on disk, an honest trade-off
  4. SCAN cursors over a live B+tree
  5. Three protocols, one file
  6. Durability, the write loop, and TTL (this one)
  7. A RedisGraph subset on a KV engine

One goroutine owns every write

BoltDB, the storage engine underneath BaseKV, allows many concurrent read transactions but exactly one read-write transaction at a time. We could have wrapped a mutex around Update() and let connection goroutines fight over it. Instead we made the serialization structural: every mutation is sent as a request on a channel, and a single goroutine drains that channel and is the only code in the process that ever opens a write transaction.

The request is a tiny struct carrying the transaction function, a reply channel, and a flag for which commit mode to use:

type writeRequest struct {
    fn    func(tx *Tx) error
    done  chan error
    batch bool
}

At startup we launch one runWriteLoop goroutine that ranges over writeCh forever. Each request runs either through BoltDB's Batch or its Update, and the result is sent back on the request's own done channel:

func (db *DB) runWriteLoop() {
    for req := range db.writeCh {
        var err error
        if req.batch {
            err = db.bdb.Batch(func(btx *bbolt.Tx) error {
                tx := &Tx{btx: btx, db: db}
                return req.fn(tx)
            })
        } else {
            err = db.bdb.Update(func(btx *bbolt.Tx) error {
                tx := &Tx{btx: btx, db: db}
                return req.fn(tx)
            })
        }
        req.done <- err
    }
}

A connection goroutine that wants to write does not call BoltDB at all. It builds a writeRequest, pushes it onto the channel, and blocks reading done. The public Update method is the front door:

req := &writeRequest{fn: fn, done: make(chan error, 1), batch: false}
db.writeMu.Lock()
db.writeCh <- req
db.writeMu.Unlock()
return <-req.done

The single-consumer channel is the lock. There is no explicit per-key locking anywhere in the engine, and no mutex guarding the BoltDB write transaction, because only one goroutine ever reaches it. Two clients writing the same key cannot interleave; their requests are simply processed one after another in channel order. The writeMu you see is not protecting the database, it is only protecting the channel itself from a send-after-close race during shutdown. The serialization that matters is the goroutine, not the mutex.

This is the trade we have been honest about since part 1. Writes do not parallelize. The ceiling on write throughput is one writer deep, and as we will see, by default that one writer pays an fsync per commit. Reads, by contrast, never touch any of this; they open their own View() and fan out across cores.

Batch versus Update, and what success actually promises

The batch flag is the first lever. Update opens a transaction, runs your function, and commits immediately, which on the default settings means one fsync before it returns. Batch is BoltDB's coalescing path: when several callers ask to batch at nearly the same moment, BoltDB groups them into a single underlying transaction and pays one fsync for the whole group. We tune that grouping with two constants set when the database opens:

bdb.MaxBatchDelay = 2 * time.Millisecond
bdb.MaxBatchSize = 1000

MaxBatchDelay is how long BoltDB will wait to gather more writes before committing the batch, and MaxBatchSize caps how many it will fold into one transaction. Two milliseconds is short enough not to add meaningful latency and long enough that, under concurrent load, a handful or a thousand small writes can share a single durable flush. This is the only knob that meaningfully raises write throughput while keeping fsync on, and it only helps when there are genuinely concurrent writers to coalesce; a single serial writer gets no benefit from batching and may as well use Update.

Now the durability contract, stated plainly. With sync on, which is the default, a success reply means the fsync completed and the write is on stable storage. If the process crashes or the machine loses power one instant after you saw success, the write is still there when BaseKV restarts, because BoltDB's commit only acknowledges after the data and metadata are durable. That is the whole promise, and it is why writes are bounded by how fast your disk can flush.

The escape hatch is --no-sync, which sets BoltDB's NoSync option globally. With it, a write returns as soon as the bytes reach the page cache, with no fsync, so throughput jumps sharply. The cost is that a crash or power loss can lose or reorder recent writes. The startup path logs a warning when no-sync is enabled precisely because the durability contract is now different: success means "accepted," not "durable." There is no async or periodic-fsync middle mode. It is durable-and-bounded, or fast-and-lossy, and you pick one at launch. We cover where each choice fits in Redis Persistence Configuration and Disk-Backed Redis, Explained.

MULTI, EXEC, and the honest scope of transactions

Redis transactions queue commands and then apply them together. BaseKV implements this per connection. MULTI sets an inTx flag on the connection's state and clears its command buffer. While that flag is set, any normal command is not executed; it is copied into the connection's queue and the client gets back QUEUED:

if connState.InTx() {
    connState.QueueCommand(cmd)
    conn.WriteString("QUEUED")
    return
}

DISCARD throws the queue away and clears the flag. EXEC is where the queued commands finally run, and the key property is that they all run inside one BoltDB write transaction. The handler pulls the queued commands, opens a single Update, and dispatches each one through the normal command handlers from within that one transaction:

err := s.db.Update(func(tx *Tx) error {
    s.db.beginExec(tx)
    defer s.db.endExec()
    for _, qCmd := range queuedCmds {
        handler, ok := GetCommandHandler(qCmdName)
        res, err := handler(s.db, qCmd)
        results = append(results, res /* or err */)
    }
    return nil
})

There is a subtlety that the single-writer model forces. Each queued command's handler would normally call db.Update, which would try to send another request onto writeCh and block, deadlocking against the EXEC transaction that is already running on the write loop. So beginExec records that a transaction is active and which goroutine owns it. When a handler inside EXEC calls Update, the engine notices the calling goroutine already owns an open transaction and runs the function directly against that transaction instead of queuing a new request. That is what makes the nested commands share one atomic commit rather than each becoming its own write.

We want to be honest about the scope here. This gives you all-or-nothing at the storage layer: the whole EXEC commits as one BoltDB transaction, so a crash mid-EXEC leaves none of it applied. What it does not give you is the full Redis transaction feature set. There is no WATCH and therefore no optimistic-locking compare-and-set across the transaction. There is no Lua scripting. And there is no mid-transaction rollback of individual commands; a command that errors records its error in the results array and the rest still run, exactly as Redis does, because the atomicity is the single underlying transaction and nothing finer. If you need conditional logic across keys, you build it above BaseKV, not inside a MULTI block.

TTL: three mechanisms that cover each other's gaps

Expiry in BaseKV is not one feature, it is three that work together, because no single one of them is sufficient on its own. Expiry timestamps live in their own root expiry bucket, keyed so the record for mykey sits at expiry:mykey. Keeping them separate from the data means the expiry machinery can scan only TTL records without walking the entire keyspace.

The first mechanism is lazy expiry on read. Every Get looks up the expiry record before returning a value. If the stored timestamp is at or before now, the key is treated as absent and ErrKeyNotFound is returned, and if the read happens to be inside a writable transaction the stale data and its expiry record are deleted on the spot:

if errParse == nil && time.Now().Unix() >= expAtUnix {
    if tx.btx.Writable() {
        _ = stringsBkt.Delete(key)
        _ = expiryBkt.Delete(ek)
    }
    return nil, ErrKeyNotFound
}

Lazy expiry guarantees correctness: an expired key is never visible, even if nothing has gotten around to physically deleting it yet. But on its own it leaks. A key that expires and is never read again would sit on disk forever, occupying space. So we need something that actively reclaims.

The second mechanism is the active sweeper, a background goroutine launched at startup as sweepExpiredKeys(1*time.Second, 500). Roughly once a second it opens a write transaction, walks the expiry bucket with a cursor, and for each record whose timestamp has passed it deletes the underlying key's data and removes the expiry record, up to a cap of 500 per sweep so one tick cannot monopolize the write loop. The cap and the interval together bound how much expiry work competes with real traffic.

Scanning the whole expiry bucket every second is wasteful when most records are nowhere near due, so the third piece is a min-heap of expiry times, ordered soonest-first. A scheduler goroutine peeks the heap's earliest item, sleeps exactly until that timestamp instead of polling, then deletes what has come due. New TTLs push onto the heap and wake the scheduler through a small channel so it can recompute its sleep:

db.ttlIndex[string(keyCopy)] = expAt
heap.Push(&db.ttlHeap, expiryItem{key: keyCopy, expAt: expAt})

The heap is paired with a ttlIndex map so a stale heap entry, one whose key was overwritten with a new TTL or persisted, is recognized and skipped rather than wrongly deleting a key. On startup the heap is rebuilt by reading the expiry bucket once. The periodic sweeper and the heap-driven scheduler overlap deliberately: the scheduler handles the common case promptly, and the once-a-second sweep is a backstop that catches anything the heap missed, including records left by a previous run.

Sitting in front of all of this is a small TTL cache, about ten thousand entries mapping a key to its expiry timestamp, so hot keys do not force a re-read of the expiry bucket on every access. Its eviction is deliberately crude: when it fills, it is cleared wholesale rather than maintaining an LRU, which keeps the cache cheap and lock-light at the cost of an occasional cold start. It is an IOPS optimization, not a source of truth; the buckets on disk remain authoritative, and lazy expiry on the real read is what guarantees correctness if the cache is empty or wrong.

COMPACT: reclaiming what BoltDB will not give back

One more durability-adjacent operation deserves mention, because it surprises people. BoltDB does not return freed pages to the operating system. When you delete a million keys, the file does not shrink; those pages go on an internal freelist and get reused by future writes, but the file stays as large as its high-water mark. For a workload that churns through a lot of short-lived or expiring keys, the on-disk file can stay much bigger than the live data.

COMPACT is the answer. It opens a fresh temporary BoltDB file, copies every bucket and key across into it, swaps the compacted file into place, and reports what it reclaimed:

saved := originalSize - newSize
return map[string]interface{}{
    "original_size": originalSize,
    "new_size":      newSize,
    "saved_bytes":   saved,
}, nil

We are blunt about its cost. COMPACT is heavy and blocking: it reads and rewrites the entire database, so it is something you schedule during a quiet window, not something you run casually under load. It is the deliberate, occasional counterpart to the everyday write path, the way you claw back the space that the freelist model trades away in exchange for never needing to coordinate with the kernel about shrinking a mapped file.

The takeaway

The write loop is the spine of BaseKV's durability story. One goroutine owns every mutation, so serialization is structural and there are no per-key locks; Batch coalesces concurrent writes into one fsync within a 2ms, 1000-write window while Update commits immediately; and with sync on, a success reply means the data survived to disk, a guarantee you trade away knowingly with --no-sync. MULTI/EXEC rides the same path to give all-or-nothing commits, honestly scoped to a single transaction with no WATCH, Lua, or partial rollback. Expiry is lazy on read for correctness, active via a heap-driven scheduler and a one-second sweeper for reclamation, and cached to spare the IOPS, with COMPACT as the heavy lever for giving disk space back. None of it is magic, and that is the point: it is a small set of mechanisms you can hold in your head, each covering a gap the others leave.


Previous in the series: Three protocols, one file. Next: A RedisGraph subset on a KV engine. Start at the series hub. Related reading: Redis Persistence Configuration, Disk-Backed Redis, Explained.