DIY an on-disk B+ Tree

💡 This article is a speed run/crash course/concise note/summary and assumes the reader has read or is somewhat familiar with concepts explained in part I of database internals. Notably chapter 1 - 4, and parts of chapter 5 where Buffer Management is explained.

If not, here’s a visual crash course on the operations of a b-tree and generally on-disk considerations. or a much better walkthrough with sqlite

What is a Datastructure’s memory representation? Everything is either a contigous block or pointer based. Relevant programming techniques: pointers, recursion and binary search - Olog(N) is powerful, syscalls and binary formats.

B-Trees are useful for:

  1. In-memory indexes/‘index b-trees’ (also used here!)
  2. persisted on disk storage organisation/’table b-trees’. <– we’re here.

Why is it called a B-Tree? According to one of the co-inventor’s Edward M. McCreight it’s short for “balance” – but could mean anything :)

Implementation high level ideas

Considerations:

performance big ideas overview:

B Trees (In-Memory)

A simple and useful way of thinking of access and logarithmic bisection is a Tree of a 2-D array:

[[1,2,3], [4,6], [9,11,12]]

visualisation: https://www.cs.usfca.edu/~galles/visualization/BPlusTree.html

a simple btree example

desired properties:

/*
Simple Persistent B Plus Tree.
Node keys are assumed to be signed integers and values also.
Persistence is achieved using a naive bufio.Writer + flush.
Concurrency control using a simple globally blocking RWMutex lock.

B-Tree implementations have many implementation specific details 
and optimisations before they're 'production' ready, notably they 
may use a free-list to hold 'free' pages and employ 
sophisticated concurrency control. 
see also: CoW semantics, buffering, garbage collection etc

// learn more:
// etcd: https://pkg.go.dev/github.com/google/btree
// sqlite: https://sqlite.org/src/file/src/btree.c
// wiki: https://en.wikipedia.org/wiki/B%2B_tree
*/

Definition:

A B plus tree with an arbitrary max degree 3, degree is the number of pointers/children each node can point to/hold:

const (
	MAX_DEGREE = 3
)

const (
	ROOT_NODE NodeType = iota + 1
	INTERNAL_NODE
	LEAF_NODE
)

type BTree struct {
	root *Node
}

A node:

type Node struct {
	kind     NodeType
	// maintaining a parent pointer is expensive
	// in internal nodes especially
	parent   *Node
	keys     []int
	children []*Node
	data     []int

	// sibling pointers these help with deletions + range queries
	next     *Node
	previous *Node
}

Operations Overview:

func (n *Node) basicSearch(key int) *Node {
	if len(n.children) == 0 {
// you are at a leaf Node and can now access stuff
// or this is the leaf node that should contain stuff
		return n
	}

	low, high := 0, len(n.keys)-1
	for low <= high {
		mid := low + (high-low)/2
		if n.keys[mid] == key {
			return n
		} else if n.keys[mid] < key {
			low = mid + 1
		} else {
			high = mid - 1
		}
	}

	return n.children[low]
}
func (n *node) split(midIdx int) error {
// first find a leaf node.
// every node except the root node must respect the inquality:
// branching factor - 1 <= num keys < (2 * branching factor) - 1

// if this doesn't make sense ignore it. The take away:
// every node except root has a min/max num keys or it's invalid.

// edge case, how to handle the root node?

// node is full: promotion time, split keys into two halves
	splitPoint := n.keys[midIdx]
	leftKeys := n.keys[:midIdx]
	rightKeys := n.keys[midIdx:]

	n.keys = []int{splitPoint}

	leftNode := &node{kind: LEAF_NODE, keys: leftKeys}
	rightNode := &node{kind: LEAF_NODE, keys: rightKeys}
	n.children = []*node{leftNode, rightNode}


// -- LEAF
//  (internal node(left))  (internal node(right))
//   \               /
//   (current leaf node)


// recurse UP from curr to node which may overflow,
// check that we're not full if full, we split
// again allocate a new node(s)
// --snipped for clarity
	return nil
}
func (n *Node) merge(sibling *Node, key int) error {
	// delete data from leaf node
	// steal sibiling fist, if underfull --snipped
	// do stuff to prepare for merging, assume we _can_ merge

	// deallocate/collapse underflow node
	sibling.data = append(sibling.data, n.data...)

	for i, node := range sibling.parent.children {
		if node == n {
			n.parent.children = append(n.parent.children[:i], n.parent.children[i+1:]...)
		}
	}

	// recurse UPWARD and check for underflow
	for i, k := range sibling.parent.keys {
		if k == key {
			sibling.parent.keys = cut(i, sibling.parent.keys)

			if len(n.parent.keys) < int(math.Ceil(float64(MAX_DEGREE)/2)) {
					f sibling, _, err := sibling.parent.preMerge(); err == nil {
					return n.parent.mergeSibling(t, sibling, key)
				}
			} else {
				return nil
			}
		}

	return nil
}

B Trees (Going to Disk)

Cannot reference memory using pointers. Can no longer allocate/deallallocate freely. We read/write/seek to the operating system, in fixed size blocks, commonly 4KiB - 16KiB.

Classic B+Tree paper uses a triplet -{pointer/offset to child, key, value}, limited by fixed size storage (fragmentation is difficult).

the “classic” datafile layout:

+++++++++++++++++++++++++++++++++++++++++++++++++++++++
+ header | {p1, key1, value1}, {p2, key2, value3}  .. +
+++++++++++++++++++++++++++++++++++++++++++++++++++++++

Slotted Pages are common in most database row/tuple oriented implementations such as SQLite and Postgres. Slotted pages are used to solve the problems space reclaimation and variable size data records. In columnar format encoding such as parquet a modern variable length encoding is used and a dictionary maintained of data pages that is compressed and tightly packed.

For e.g the page layout in postgres: postgres page layout

The logical view of the datafile:

[header (fixed size)] [page(s) 4KiB | ...] [trailer(fixed size)]

show me the code - the high level “page”:

// Page is (de)serialised disk block similar to: https://doxygen.postgresql.org/bufpage_8h_source.html
// It is a contigous 4kiB chunk of memory, both a logical and physical representation of data.
type Page struct {
	header header

	cellPointers []int16
	cells        []cell
}

its fixed size byte page header:

type pageHeader struct {
	// Represents a marker value to indicate that a file is a Bolt DB.
	// copy/pasta as a magic number is 'magic' and kind of madeup.
	magic     uint32 // 4 bytes
	PageID    uint32 // 4 bytes
	Reserve   uint32 // 4 bytes

	FreeSlots uint16 // 2 bytes
	// the physical offset mapping to the begining
	// and end of an allocated block on the datafile for this page
	PLower    uint16 // 2 bytes
	PHigh     uint16 // 2 bytes
	NumSlots  byte   // 1 byte (uint8)

	// all cells are of type CellLayout ie is key/pointer or key/value cell?
	CellLayout byte // 1 byte (uint8)
	pageType   byte

}

and finally the cell:

type cell struct {
	cellId    int16
	keySize   uint64
	valueSize uint64
	keys      []byte
	data      []byte
}

naively flushing a page + fsync:

func (p *Page) Flush(datafile *os.File) error {
	buf := new(bytes.Buffer)

	_, _ := datafile.Seek(int64(p.PLower), io.SeekStart)

	binary.Write(buf, binary.LittleEndian, &p.pageHeader)
	binary.Write(buf, binary.LittleEndian, &p.cellPointers)
	binary.Write(buf, binary.LittleEndian, &p.cells)

	buf.WriteTo(datafile)
	datafile.Sync()

	log.Printf("written %v bytes to disk at pageID %v", n, p.PageID)
	return nil
}

and fetching it back:

const pageDictionary map[int]int 
// --snipped, tldr; hashmap conceptually
// maps the page ID to an actual offset

func Fetch(pageId int, datafile *os.File) (Page, error) {
	var page Page

	datafile.Seek(pageDictionary[pageID], io.SeekStart)
	err := binary.Read(datafile, binary.LittleEndian, &page.pageHeader)

	return page, nil
}

There’s some cool optimisations we can do!

Miscellaneous

const magic uint32 = 0xED0CDAED

See the demo repository for more examples!

Future, Maybe Never :)

- The pitfalls of memory: allocation, fragmentation & corruption
- concurrency mechanisms - snapshots, OCC/MVCC
- async split/merge/delete
- BTStack/Breadcrumbs
- lazy traversal: Cursor/Iter
- more optimisations/variants: B-link, CoW B-Trees, FD-Trees etc
- robust testing and correctness guarantees
- se(de)serialisation to network (replication)
- DIY a freelist!
- IO_URING/async direct/IO
- durability! DIY a WAL or smaller "rollback journal"
- good benchmarking & profiling etc see tools of the trade

Tools of the Trade

Base your judgement on empirical fact:

Performance comes from thinking wholistically about hardware and software together. Optimizations bear the fruits of pretty benchmarks and complexity. The first big piece is concurrency and parallelism.

tldr; the hard part about building storage engines is debugging and testing them, old databases are good because time spent in production uncovering (or not uncovering bugs).

In the real world stuff goes wrong(the operating system hides hardware and software faults, but you have to care), data loss is bad and is a big no-no.

Running in production: Correctness, Testing & Safety

Complexity is evil but unavoidable, non-determinism makes you helpless.

testing methodology, loom - concurrency is hard etc:

Further reading: storage engine architecture

Move towards modularization of database components, decoupled query & execution engine(velox, datafusion etc), storage engine see: foundationDB.

What the heck is going on in your favorite database? select/biased popular deep dives into popular storage engines for: postgres/postgres, kubernetes/etcd, mysql/InnoDB, mongodb(WiredTiger):

postgres: https://postgrespro.com/blog/pgsql/4161516

etcd: https://etcd.io/docs/v3.5/learning/data_model/

innodb: https://dev.mysql.com/doc/refman/8.0/en/innodb-physical-structure.html

mongodb: https://source.wiredtiger.com/11.2.0/arch-btree.html

Things we’ve come to want/expect out of modern databases: