Web Analytics Made Easy - Statcounter
BaseKV
Sign InSign Up
Back to Articles

BaseKV Internals: SCAN Cursors over a Live B+tree

How BaseKV implements Redis SCAN on a BoltDB B+tree: an opaque random cursor backing a server-side session, a multi-phase walk across type buckets, and per-call snapshots.

BaseKV Team11 min read
scanboltdbb-treerediscursorsconsistencyinternals

This is part 4 of our series on the internals of BaseKV, a disk-based, Redis-compatible key-value engine written in Go on top of a single BoltDB file. In part 1 we laid out the design bet, in part 2 we showed how each Redis type is encoded into the B+tree, and in part 3 we walked through sorted sets on disk. You are reading part 4. Next, in part 5, we serve three protocols out of one file.

SCAN is the command that forces every design decision in the type encoding to come home to roost. A client calls SCAN 0, gets back a cursor and a handful of keys, calls SCAN <cursor> again, and keeps going until the cursor comes back as 0. The contract Redis offers is deliberately weak: every key that is present for the whole duration of the scan is returned at least once, keys that churn in and out during the scan may or may not appear, and a key is never returned in a corrupted half-state. No global lock, no blocking, no snapshot held open for the lifetime of the iteration. We had to reproduce that contract on a BoltDB B+tree where, as part 2 explained, a single logical Redis key is not one row but a scatter of physical keys spread across six type buckets. This is how we did it, and where the honesty about its limits lives.

The problem SCAN actually poses

A naive implementation of key iteration is KEYS *, and we have one. It opens a read transaction, walks every bucket with ForEach, strips each physical key back to its user-facing name, dedups into a map, and returns the lot. It is correct and it is a foot-gun: it holds a read view open while it materializes the entire keyspace into memory, and it blocks nothing else but allocates everything. Redis ships KEYS with the same warning. SCAN exists precisely so a client can page through the keyspace in bounded chunks without ever holding a long transaction or a full result set.

The difficulty specific to BaseKV is that "the keyspace" is not a single ordered list. Strings live in one bucket. Hash fields live in a hash bucket keyed by <key> <field>. Set and zset members live in their own buckets keyed by a length-prefixed composite. Streams live inline in the data bucket under a stream: prefix. Graphs live in a graph bucket under an m: meta prefix. To return each user key exactly once, we have to walk six different physical orderings, translate each physical key back to its logical name, and suppress the duplicates that the encoding inevitably produces (a hash named user:1 shows up under as many physical rows as it has fields). And we have to be able to stop in the middle of any of those buckets and resume there on the next call.

The session model

A Redis cursor is supposed to be an opaque token. The textbook Redis implementation encodes a reverse-binary position in the cursor itself, so the cursor is stateless and the server holds nothing between calls. We chose the other valid option: the cursor is a handle to server-side state, and the iteration position lives in a session. That session is the heart of this article.

type scanPhase int

const (
	scanPhaseStrings scanPhase = iota
	scanPhaseHash
	scanPhaseSet
	scanPhaseZSet
	scanPhaseStream
	scanPhaseGraph
	scanPhaseDone
)

type scanSession struct {
	mu         sync.Mutex
	phase      scanPhase
	lastKey    []byte // last visited key in the current phase
	seen       map[string]struct{}
	lastAccess time.Time
}

Four fields carry everything. phase records which type bucket we are currently draining; the iteration runs strings, then hash, then set, then zset, then stream, then graph, then scanPhaseDone. lastKey is the resume point: the raw physical key we last looked at inside the current bucket, so the next call knows where to seek. seen is a dedup set of user-facing key names already returned, so a name that exists in two buckets (a string foo and a set foo, say) is returned once across the whole scan. lastAccess is the timestamp the janitor uses to decide when to evict an abandoned session.

Sessions live in a single global map guarded by a mutex:

var (
	scanSessionsMu sync.Mutex
	scanSessions   = map[uint64]*scanSession{}
)

When a client calls SCAN 0, we mint a new session and hand back its id as the cursor:

var b [8]byte
_, _ = rand.Read(b[:])
id := binary.BigEndian.Uint64(b[:])
if id == 0 {
	id = 1
}

The id is a random 64-bit number from crypto/rand, not a serialized offset and not a counter. We force it away from 0 because 0 is reserved to mean "start a new scan" on the way in and "scan complete" on the way out. We loop on the rare collision and re-roll. The opacity is genuine: a client cannot decode position, byte offset, or progress from the cursor, because there is nothing encoded in it. It is a lookup key into a server-side map and nothing more.

The dispatch at the top of handleSCAN is correspondingly simple. Cursor 0 creates a session; any other cursor looks one up. If the lookup misses, which happens when a session has been evicted or the server restarted, we do not error. We return cursor 0 and an empty key list, which a Redis client reads as a completed scan. That is the most graceful failure available to us, and we will come back to why a miss can happen.

One snapshot per call, not per scan

Every call to handleSCAN opens its own BoltDB read transaction through db.View, which is a thin wrapper over bbolt's View:

func (db *DB) View(fn func(tx *Tx) error) error {
	return db.bdb.View(func(btx *bbolt.Tx) error {
		tx := &Tx{btx: btx, db: db}
		return fn(tx)
	})
}

Inside that callback we hold a consistent, point-in-time read snapshot of the whole database, courtesy of BoltDB's MVCC. Writers proceed against a newer version; our cursor walks a frozen tree and never sees a torn write. That is what gives us the "never returns a corrupted key" half of the Redis contract for free, within a single call.

The word "single" is the load-bearing one. The snapshot exists only for the duration of one SCAN call. The session state, by contrast, persists across calls. So a full scan is stitched together from many independent snapshots, one per page, with the session carrying phase and lastKey across the gaps between them. We deliberately do not hold a read transaction open for the lifetime of a scan, because a client that fetches one page and wanders off would otherwise pin an old database version and block compaction indefinitely. This split, durable session plus ephemeral snapshot, is exactly the trade-off Redis makes, and it is the source of the consistency caveat at the end.

Walking a bucket and advancing the phase

The body of a single call is a loop that keeps pulling keys until it has collected COUNT of them (default 10) or the phase reaches scanPhaseDone. Each phase opens the relevant bucket, takes a cursor, and seeks to just past lastKey:

func cursorSeekAfter(c *bbolt.Cursor, lastKey []byte) (k, v []byte) {
	if lastKey == nil {
		return c.First()
	}
	k, v = c.Seek(lastKey)
	if k != nil && bytes.Equal(k, lastKey) {
		k, v = c.Next()
	}
	return k, v
}

A nil lastKey means "start of bucket." Otherwise we seek to lastKey and, if the exact key is still present, step one past it. If the key we stopped on has since been deleted, Seek lands on the next key that exists, which is precisely where we want to resume. After each physical key we update lastKey with a copy, then translate that physical key to its logical name and feed it through a small collector:

addKey := func(userKey string) {
	if _, ok := sess.seen[userKey]; ok {
		return
	}
	if matched, _ := filepath.Match(pattern, userKey); !matched {
		return
	}
	sess.seen[userKey] = struct{}{}
	keys = append(keys, userKey)
}

The MATCH glob is applied here with filepath.Match, and the dedup against seen happens before we append. The translation differs per phase because the encoding differs per phase. In the strings bucket we skip the internal list:<key>:idx:... element rows (seeking past them in bulk with a prefix upper-bound) and surface only the list:<key>:meta row and the hll:<key> row as logical keys. In the hash bucket we split on the first space to recover the key name and then jump straight past the rest of that key's fields with a prefix seek, so a hash with ten thousand fields costs one logical emission, not ten thousand iterations. Set and zset rows are length-prefixed composites, so we read the 4-byte length, slice out the base key, and prefix-seek past the rest of the members. Streams and graphs each have their own prefix shape. When a bucket's cursor returns nil, the phase advances and lastKey resets to nil for the next bucket.

That prefix-skipping matters for COUNT to behave. Without it, a single fat hash would let SCAN burn its entire COUNT budget on one logical key, and the client would page forever. By seeking past each key's member range, the work per call stays proportional to the number of distinct logical keys we touch, not the number of physical rows underneath them.

When the loop finishes, the cursor returned to the client is the session id, unless the phase reached scanPhaseDone, in which case we delete the session and return 0:

nextCursor := sid
if done {
	deleteScanSession(sid)
	nextCursor = 0
}
sort.Strings(keys)

We sort the page before returning it. Redis does not promise ordered output and we do not promise it either, but a stable order inside a page is friendlier to the humans and tools reading the results, and it costs nothing meaningful.

The janitor

Because sessions are server-side state, an abandoned scan leaks memory. A client that calls SCAN 0, reads one page, and never comes back has left a scanSession (and its growing seen map) sitting in the global map forever. We sweep these with a janitor goroutine started lazily on the first scan:

t := time.NewTicker(30 * time.Second)
for range t.C {
	cutoff := time.Now().Add(-2 * time.Minute)
	scanSessionsMu.Lock()
	for id, s := range scanSessions {
		s.mu.Lock()
		la := s.lastAccess
		s.mu.Unlock()
		if la.Before(cutoff) {
			delete(scanSessions, id)
		}
	}
	scanSessionsMu.Unlock()
}

Every 30 seconds it walks the map and evicts any session whose lastAccess is older than two minutes. Two minutes is comfortably longer than any reasonable gap between consecutive SCAN calls in a paging loop, and short enough that a forgotten scan does not pin memory for long. If a client does resume an evicted session, the cursor lookup misses and, as above, it gets a clean 0 back rather than an error. The scan ends early from the client's point of view, which is a defensible outcome for a scan nobody was driving.

The honest consistency story

Here is where we tell you what this design does not give you, because that is more useful than pretending it gives you everything.

The snapshot is per call, not per scan. Between two SCAN calls the database moves on. A key that is created after we have already passed its position in a bucket, on a later snapshot, will not be seen by this scan. A key that is deleted after we returned it stays returned. A key that exists from the first call to the last is guaranteed to appear, because it sits at a stable position in every snapshot we open and our lastKey resume point will eventually reach it. This is exactly Redis SCAN's guarantee, stated plainly: full coverage for keys that persist across the whole scan, best-effort for keys that churn during it. We did not weaken Redis's contract and we did not strengthen it; we reproduced it, including its soft edges.

There is a subtler interaction with the seen set. Dedup is by logical name, so if a key foo is a string when we pass the strings bucket and is then deleted and recreated as a set before we reach the set bucket, seen already holds foo and the set incarnation is suppressed. The scan returns foo once, which is the right count, even though the value it conceptually refers to changed underneath. We consider that acceptable: SCAN returns names, not values, and the name was live throughout.

Two more limits worth naming. Sessions are not durable across a restart. If the server restarts mid-scan, every session in the map is gone, the next call's cursor misses, and the client gets 0. The client must start over. And HSCAN, which scans the fields of one hash, does not use this session machinery at all; it uses a plain integer offset cursor computed per call against that one key's contiguous fieldspace, because a single hash lives as one ordered run in one bucket and does not need cross-bucket phase tracking or a seen set. The session model is the price of the keyspace scan spanning six encodings; the single-key scans do not pay it.

That is SCAN on a live B+tree: an opaque random cursor backing a server-side session, a six-phase walk that translates each physical encoding back to a logical name and skips past member runs in bulk, a per-call BoltDB snapshot for tear-free reads, and a janitor to clean up after clients that wander off. The guarantees are Redis's guarantees, soft edges and all, which is exactly what a Redis-compatible engine should offer.


This was part 4 of the BaseKV internals series. Continue to part 5: three protocols, one file, go back to part 3: sorted sets on disk, or return to the series hub in part 1.