Web Analytics Made Easy - Statcounter
BaseKV
Sign InSign Up
Back to Articles

Fitting Redis Data Types into a B+tree

How BaseKV encodes strings, sets, hashes, lists, and streams onto a single ordered B+tree using composite keys and prefix range scans, with no per-collection headers.

BaseKV Team11 min read
basekvredisb-treedata-structureskey-valuestorage-engine

You are reading part 2 of our series on the internals of BaseKV, a disk-based key-value engine that speaks the Redis, Memcached, and DynamoDB wire protocols and stores everything in a single B+tree file. In part 1 we made the case for building a Redis-compatible store on disk instead of in RAM, and we ended on a question we deferred: Redis exposes strings, lists, sets, hashes, sorted sets, and streams, but the storage layer underneath BaseKV gives us exactly one thing, an ordered map from byte keys to byte values. This part is about closing that gap. It is the heart of the series, because every other capability we will discuss later (scanning, sorted sets, durability) is built on the encoding decisions made here.

The series, for orientation:

One ordered map is all you get

BaseKV stores its data in a B+tree (via BoltDB/bbolt, an embedded ordered key-value store in a single memory-mapped file). The primitive a B+tree gives you is small and rigid: a bucket is an ordered map from byte-string keys to byte-string values, with point get, point put, point delete, and a cursor that walks keys in sorted byte order. There is no notion of a "list value" or a "set value." There is no second dimension. If you ask the tree for a key, you get back at most one opaque blob of bytes. If you want a collection, you have to manufacture it out of many individual keys.

That constraint is the whole design problem, and it shapes everything. The good news is that the one thing the tree does give us, total byte ordering with cheap range scans, turns out to be enough to express every Redis collection, as long as we are deliberate about how we lay the keys out.

We start by carving the tree into a handful of named sub-buckets, one per data type, so that a key in the set world can never collide with a key in the hash world even if a user picks the same name for both:

stringsBucketName = []byte("strings")
listsBucketName   = []byte("lists")
setsBucketName    = []byte("sets")
hashesBucketName  = []byte("hashes")
zsetsBucketName   = []byte("zsets")
streamsBucketName = []byte("streams")

A small dispatcher, getBucketForDataType, maps a logical type name to the right sub-bucket. With that in place, the interesting question becomes: inside each sub-bucket, what does a key look like?

Strings: the case where the abstraction already fits

Strings are the easy type, and worth stating plainly because they are the baseline everything else is measured against. A Redis string is a single value addressed by a single key, which is exactly what the tree already models. So a SET foo bar becomes one entry in the strings bucket: the key foo maps to the value bar. GET foo is one point lookup. There is no encoding to invent, no scan, no composite anything. One logical key, one physical key, one value.

The reason every other type is harder is that every other type is a collection: one logical key (the set name, the hash name, the list name) has to fan out to many physical entries. That is where the encoding work lives.

Sets: membership as the existence of a key

A Redis set is an unordered bag of unique members. The natural way to express "member M belongs to set S" in an ordered map is to make a composite key out of S and M, store it with an empty value, and let the mere existence of that key encode membership. Adding a member is a point put. Checking membership is a point get. Removing a member is a point delete. Listing the set is a range scan over every key that begins with S.

The subtlety is in how you concatenate S and M. The obvious approach, gluing them with a separator byte like S:M, has a latent bug: if a set name can contain the separator, or a member can, then myset plus member x:y and a hypothetical set myset:x plus member y would both encode to myset:x:y, and a prefix scan for one collection would silently pick up members of the other. Arbitrary user bytes and an in-band delimiter do not mix.

BaseKV avoids the ambiguity by length-prefixing the key portion. The composite encoder is short enough to show in full:

func encodeCompositeKey(key, member []byte) []byte {
	buf := make([]byte, 4+len(key)+len(member))
	binary.BigEndian.PutUint32(buf[:4], uint32(len(key)))
	copy(buf[4:], key)
	copy(buf[4+len(key):], member)
	return buf
}

The physical key is [4-byte big-endian length of S][S bytes][M bytes]. There is no separator at all. The length prefix removes the ambiguity completely: a reader knows exactly how many bytes belong to the key before the member begins, so the boundary is never inferred from the content. Two different (set, member) pairs can never collide, because the only way to produce identical bytes is to have an identical length, an identical key, and an identical member.

The big-endian choice matters too. Big-endian integers sort the same way under byte comparison as they do numerically, so all members of a short-named set sort before all members of a longer-named set, and within a single length, all members of one set form a contiguous block. That contiguity is what makes the range scan exact. To iterate a set we build the prefix with the same length header and no member:

func encodeCompositePrefix(key []byte) []byte {
	buf := make([]byte, 4+len(key))
	binary.BigEndian.PutUint32(buf[:4], uint32(len(key)))
	copy(buf[4:], key)
	return buf
}

Then we seek the cursor to that prefix and walk forward while keys still carry it. SMEMBERS is precisely this loop, slicing the member back out by dropping the prefix bytes:

for k, _ := cursor.Seek(prefix); k != nil && bytes.HasPrefix(k, prefix); k, _ = cursor.Next() {
	member := k[len(prefix):]
	// copy member out; cursor bytes are only valid this iteration
}

SADD puts the composite key with an empty value, SREM deletes it, SISMEMBER is a single point get on the composite key, and SCARD is the scan again but counting instead of collecting. Notice what is absent: there is no "set object" stored anywhere, no header listing the members, no cardinality counter. The set exists only as the spread of its member keys across the tree. Delete the last member and the set is simply gone, with nothing to clean up.

Hashes: a field is a key, a value is a value

A hash is a map from field to value scoped under a key, which is one step richer than a set, because the per-member slot now carries a payload instead of being empty. BaseKV stores each field as its own entry in the hashes bucket, with a composite key built by joining the hash name and the field name with a space separator:

internalKey := []byte(key + " " + field) // space separates hash name from field
return tx.SetInBucket("hash", internalKey, value)

HSET key field value writes one entry, HGET key field is one point get, and HGETALL is a prefix scan over key + " " that splits each physical key back into name and field on the first space. This is the same pattern as sets, with the empty value replaced by the field's actual value and the length-prefix swapped for a space delimiter. The space delimiter is a pragmatic, weaker choice than the length prefix used for sets: it relies on the hash name not containing a space, which holds for the common case but is not the airtight guarantee that length-prefixing gives. It is worth being candid that the two collection types made different trade-offs here, and the length-prefixed scheme is the more robust one.

Either way, the principle is identical. The hash has no central object. Its fields are individual tree entries that happen to share a prefix, and HLEN is a scan-and-count over that prefix rather than a read of a stored counter.

Lists: indices instead of array rewrites

Lists are where naive encodings fall apart. A list needs ordered access and fast push and pop at both ends. The tempting approach, store the whole list as one serialized blob under one key, makes every push an O(n) rewrite: read the entire array, append, serialize, write it all back. For a list that grows to thousands of elements, every append rewrites thousands of elements.

BaseKV instead splits a list into a metadata entry plus one entry per element, and turns push and pop into integer arithmetic on the metadata. The meta entry holds three numbers, the head index, the tail index, and the length:

type listMeta struct {
	head   int64
	tail   int64
	length int64
}

stored under a <key>:meta entry, and each element lives at its own <key>:item:<index> entry. A fresh list does not start its indices at zero; it starts head and tail near the middle of the integer range (head at 1000000, tail just below it) so that the head can keep decreasing for left-pushes without ever going negative:

meta = &listMeta{head: 1000000, tail: 999999, length: 0}

An RPUSH increments the tail, writes the new element at <key>:item:<tail>, and bumps the length. An LPUSH decrements the head and writes at the new head. Each push is one point put plus a meta update, regardless of how long the list already is. Pop is the mirror image: read the element at the head (or tail), delete it, and move the index inward. When the length hits zero the meta entry is deleted, and the list ceases to exist:

meta.tail++
itemKey := fmt.Sprintf("%s:item:%d", key, meta.tail)
tx.SetInBucket("list", []byte(itemKey), value)
meta.length++

LINDEX and LRANGE translate a logical position into a physical index by adding the offset to the head, then doing point gets. The key property is that no list operation ever touches an element it is not asked to touch. Pushing the ten-thousandth item costs the same as pushing the first. The cost we pay is that the list is no longer one contiguous blob; it is a meta entry plus N scattered tree entries, and reading a wide range is N point lookups rather than one. For a queue or a log, which push and pop at the ends and rarely scan the whole thing, that is exactly the trade we want.

Streams: order falls out of the key

Streams need entries returned in ID order, and IDs are already monotically increasing timestamps. So a stream entry is stored under a key of the form stream:<name>:<entry-id>, with the serialized field/value pairs as the value. Because the tree keeps keys in sorted byte order, the entries of a stream are already laid out in ID order on disk, and XRANGE is a prefix scan that filters by the start and end IDs as it walks. XLEN is the same scan, counting. There is no separate index to maintain, because the index is the key ordering itself.

The principle, and its bill

Step back and the pattern across all of these is one idea: a collection needs no central header object. We never store "here is the list of members" or "here is the set of fields." Membership is the existence of a key, ordering is the sort order of keys, and iteration is a range scan over a shared prefix. The collection is an emergent property of how its entries are spread through the byte-ordered tree, not a thing stored in one place. This is why creating a collection is free (just write the first member) and why deleting the last member deletes the collection (there is nothing else to delete).

The bill for this elegance is real and worth stating plainly, the same honesty we brought to locks in our distributed locks article. An operation that touches every member of a collection, SMEMBERS, HGETALL, SCARD, XLEN, LLEN on the blob-free path, is O(members) of B+tree seeks. There is no stored counter to read in O(1); the count is recomputed by walking. For collections of a few hundred members on a memory-mapped tree this is fine, because the scan is sequential over keys that sort next to each other and the pages are likely already hot. For a set with millions of members, a SCARD that walks every one of them is a genuinely expensive operation, where in-memory Redis would answer from a maintained counter instantly. That is the price of having no per-collection object to hang a counter on, and it is a deliberate trade: we accept O(members) full-collection scans in exchange for an encoding that needs no headers, no separate indexes, and no bookkeeping that could drift out of sync with the data.

If you want the conceptual grounding for why a single ordered map is such a flexible foundation, our explainers on persistent key-value storage and what a key-value database is cover the model this whole design rests on.

That is how the types are laid out. The one type we deliberately skipped is the sorted set, because score-ordered ranking does not fall out of name-ordered keys, and making it work needs a second encoding trick. That is exactly where we go next.


Continue with part 3: Sorted sets on disk, or jump to part 4 on SCAN, part 5 on the three protocols, part 6 on durability and TTL, and part 7 on the RedisGraph subset. Or return to part 1.