Sorted Sets on Disk, and an Honest Trade-Off
How BaseKV stores ZADD scores as float64 bits, why IEEE-754 is not byte-sortable, and the honest in-RAM sort behind ZRANGE and ZRANK.
This is part 3 of a 7-part series on the internals of BaseKV, a disk-based, Redis-compatible key-value engine written in Go. You are reading part 3. If you want the framing first, part 1 lays out the disk-cost thesis and part 2 explains how we fold Redis data types into a single on-disk B+tree. This part is the candid one. Sorted sets are the place where the disk-first design meets a data structure that really wants to live in memory, and we are going to be honest about the seam.
What a sorted set has to do
A Redis sorted set keeps members ranked by a floating-point score. The operations that matter are ZADD to insert or update, ZSCORE to look up one member's score, and the family that depends on order: ZRANGE, ZREVRANGE, ZRANK, ZREVRANK, ZCOUNT. Redis backs this with a skip list plus a hash map, which gives O(log n) inserts and O(log n) rank queries. That combination is exactly what is hard to reproduce on disk, because a skip list is a pointer structure tuned for RAM. We do not have a skip list. We have a B+tree, the one bbolt gives us, and everything has to be expressed in terms of ordered byte keys inside that tree.
Storing a member
Every sorted set member becomes one key-value pair in the zset bucket. The key is a composite of the set name and the member name, built by the same encodeCompositeKey helper that part 2 introduced:
func encodeCompositeKey(key, member []byte) []byte {
buf := make([]byte, 4+len(key)+len(member))
binary.BigEndian.PutUint32(buf[:4], uint32(len(key)))
copy(buf[4:], key)
copy(buf[4+len(key):], member)
return buf
}
The layout is [4-byte length of set key][set key bytes][member bytes]. The length prefix is what keeps one set's members from bleeding into another's during a prefix scan: a set named a and a set named ab get distinct, unambiguous prefixes because the length is part of the key, not just the bytes. To list a set we build the prefix [4-byte length][set key] and seek to it with a cursor.
The value is the score, and the score is where this article starts. ZADD parses the score string to a float64 and writes the raw IEEE-754 bits as eight big-endian bytes:
scoreBytes := make([]byte, 8)
binary.BigEndian.PutUint64(scoreBytes, math.Float64bits(score))
math.Float64bits reinterprets the float's bit pattern as a uint64 without changing any bits, and PutUint64 writes it big-endian. ZSCORE does the exact inverse, math.Float64frombits(binary.BigEndian.Uint64(item.Value)), to recover the float. That round trip is lossless and cheap, and a point lookup of one member's score is a single B+tree get keyed by the composite key. So far this is a clean, fast encoding. The trouble is the word "order."
The problem: float bits are not byte-monotonic
It is tempting to think we have already solved sorted ranges. The scores are sitting in the tree as big-endian bytes. bbolt orders keys by raw byte comparison. Why not store the score as the key, range-scan the bucket, and read members out in score order for free? That is precisely how you would build it if the trick worked. It does not, and the reason is worth being exact about, because it is the crux of the whole trade-off.
IEEE-754 float64 bit patterns are not monotonic under unsigned byte comparison. Two specific things break.
First, negatives sort backwards. A float64's top bit is the sign bit, set to 1 for negative numbers. Under unsigned comparison any value with the high bit set looks larger than any value with it clear, so every negative score would sort after every positive score, which is upside down to begin with. Worse, within the negatives the magnitude bits run the wrong way: -1.0 and -2.0 differ in their exponent and mantissa such that -2.0, the smaller number, compares as the larger byte string. The negative range is not just misplaced, it is internally reversed.
Second, even ignoring signs, you cannot blindly trust the layout for mixed data. Positive floats happen to sort correctly among themselves by byte order, which is a nice property of the format, but the moment a single negative score enters the set the byte ordering and the numeric ordering diverge. A scoreboard with a penalty of -5 next to scores of 10 and 100 would come back in nonsense order from a raw range scan.
So the obvious optimization is off the table with the encoding as written. We cannot range-scan the values and call them sorted. The bytes lie about the order.
What BaseKV actually does: sort in RAM
We took the simple road, and we want to name it plainly rather than dress it up. For every order-dependent operation, BaseKV loads the entire set into memory, decodes each score, and sorts in RAM. ZRANGE is the clearest example. It seeks to the set's prefix and walks the cursor, collecting every member:
for k, v := cursor.Seek(prefix); k != nil && bytes.HasPrefix(k, prefix); k, v = cursor.Next() {
member := k[len(prefix):]
score := math.Float64frombits(binary.BigEndian.Uint64(v))
members = append(members, ZSetMember{Member: append([]byte(nil), member...), Score: score})
}
Then it sorts the slice by the decoded float and slices out the requested index window:
sort.Slice(members, func(i, j int) bool {
return members[i].Score < members[j].Score
})
ZREVRANGE is identical with the comparison flipped to >. ZRANK does the same gather-and-sort, then scans the sorted slice for the member and returns its index; ZREVRANK sorts descending and does the same. ZCOUNT walks the full prefix and counts the members whose decoded score falls in the range. ZSCORE is the only one that escapes, because looking up a single member by name is a point read on the composite key and never needs ordering. ZCARD also escapes, since counting is just walking the prefix without decoding anything.
This is honest O(n log n) work per call, where n is the cardinality of the one set being queried, plus the cost of reading those n pages off disk (or the OS page cache, on a warm set). It is not Redis's O(log n). For a ZRANK on a set of a thousand members we read and sort a thousand entries to answer a question about one. Redis would touch a handful of skip-list nodes.
We are comfortable with this for the workloads BaseKV is built for, and we want to draw the line where it actually is. For moderate sets, up to roughly the tens of thousands of members, the gather-and-sort is a sub-millisecond to low-millisecond operation and nobody notices. Leaderboards for a game lobby, recently-active user lists, priority queues with a bounded backlog, tag indexes with a few thousand entries: these are fine, and they are the common case. Where it degrades is high-cardinality sets queried hot in a loop. A set with hundreds of thousands of members, hit by ZRANGE on every request, will spend real time re-sorting the same data over and over, and the cost grows with cardinality on every single call because nothing is cached between calls. If that is your access pattern, BaseKV's sorted sets are the wrong tool today, and you should know that before you build on them.
What it would take to fix this
The fix is not exotic, and the reason we are describing it in detail is that it is the natural next step and any engineer reading the code will think of it. The goal is to make the B+tree itself yield score order, so range and rank become cursor walks instead of full sorts. That requires a secondary index keyed by score, and the score has to be encoded in a way that is monotonic under byte comparison. This is a known technique, sometimes called an order-preserving or byte-comparable float encoding.
The trick works on the bit pattern directly. Take the float64 bits. If the sign bit is 0, the number is non-negative, and you flip just the sign bit to 1. If the sign bit is 1, the number is negative, and you flip every bit. After that transform, plain unsigned big-endian comparison of the eight bytes matches numeric order across the entire float64 range, negatives included. The intuition is that flipping the sign bit lifts the positives above the negatives where they belong, and inverting all the bits of the negatives undoes their internally-reversed ordering so that -2.0 once again compares below -1.0. Decoding is the same transform run backwards before handing the value to Float64frombits.
With that encoding in hand, the design is a second bucket, call it zscore, holding keys shaped like [set key][order-preserving score][member] mapped to nothing, alongside the existing zset bucket that maps member to raw score for ZSCORE. ZADD would write both, and would first delete the member's old score-index entry to avoid leaving a stale rank behind. ZRANGE becomes a bounded cursor walk over the zscore prefix in tree order, reading only the window you asked for. ZRANK becomes a walk that counts until it reaches the member, or, with a count maintained in the tree, something closer to a seek. ZCOUNT becomes a seek to the encoded min and a walk to the encoded max. The per-call cost drops from O(n) reads to O(log n + range size), which is the whole point.
Why we have not built it yet
The honest answer has three parts, and none of them is "we forgot."
The first is that it costs simplicity, and simplicity is a feature we are deliberately protecting. A score index doubles the writes on every ZADD and ZINCRBY, adds a delete-then-insert dance to keep the index consistent with the member-to-score map, and introduces a second place where a bug can desynchronize the two views of the same set. Today the sorted-set code is short enough that you can read all of it in one sitting and convince yourself it is correct, and the zset bucket has exactly one fact in it per member. That auditability has real value while the engine is young.
The second is the disk-cost thesis the whole project rests on, which part 1 argues in full. BaseKV exists because a great many workloads do not need RAM-priced storage; they need cheap, durable, persistent storage with a Redis-shaped interface. The sets in those workloads are usually small. Optimizing the data structure for the high-cardinality hot-loop case would be optimizing for a workload that, by design, is not our center of gravity.
The third is that real workloads have not pushed on it. We would rather add the secondary index when a concrete usage pattern demands it, and design the encoding and the migration against that real shape, than ship speculative complexity now and maintain it forever. The order-preserving encoding is well understood and the path to it is clear, so the cost of waiting is low. When a sorted set's cardinality and query rate make the in-memory sort the bottleneck, that is the signal to build the index, and the design above is the one we will reach for.
The takeaway
BaseKV stores each sorted-set member as a composite key with the score written as raw big-endian float64 bits, which makes ZSCORE a clean point lookup. It does not get score-ordered ranges for free, because IEEE-754 bit patterns are not byte-monotonic, so the ordered operations gather the whole set into memory and sort.Slice it by float on every call. That is O(n log n) per query, fine into the tens of thousands of members and slower as cardinality grows. The fix is a secondary index keyed by an order-preserving score encoding, flip the sign bit for positives, invert all bits for negatives, so the B+tree yields sorted order directly. We have not built it because simplicity, the disk-cost thesis, and the absence of a workload that needs it all point the other way for now. We would rather tell you exactly where the edge is than pretend it is not there.
Continue the series: part 4 walks SCAN over a live B+tree (next), or go back to part 2 on encoding Redis types into a B+tree (previous). Part 1 is the series hub.