How do databases count?
Given the simple query below, how does a database count?
psql -U postgres
select count(distinct col1) from table1;
Let’s ask the database directly, in this case, it’s postgres:
explain analyze select count(distinct col1) from table1;
This produces a series of algorithmic steps and structures specific to a database, outputted as a tree of “paths”, which you can read bottom-up, a trivial example which is inlined and doesn’t need a series of ‘optimisation passes’:
postgres=# explain analyze select 1 + 1;
QUERY PLAN
------------------------------------------------------------------------------------
Result (cost=0.00..0.01 rows=1 width=4) (actual time=0.001..0.012 rows=1 loops=1)
Planning Time: 0.522 ms
Execution Time: 0.128 ms
(3 rows)
This is not the only representation of a query plan, sqlite on the other hand does a curious thing, instead of holding a tree as an internal representation, it compiles down to bytecode, why it makes this decision is a plenty interesting design space1:
sqlite> explain select 1 + 1;
addr opcode p1 p2 p3 p4 p5 comment
---- ------------- ---- ---- ---- ------------- -- -------------
0 Init 0 4 0 0 Start at 4
1 Add 2 2 1 0 r[1]=r[2]+r[2]
2 ResultRow 1 1 0 0 output=r[1]
3 Halt 0 0 0 0
4 Integer 1 2 0 0 r[2]=1
5 Goto 0 1 0 0
If you’d rather see the plan, rather than reading opcodes:
sqlite> explain query plan select 1 + 1;
QUERY PLAN
`--SCAN CONSTANT ROW
A query plan is the output of a program, like all programs, it has a rich history, architectural decisions, algorithms, datastructures, trade-offs and constraints. It takes as input a query typically in a query language here it’s SQL and lets you retrieve ‘facts’ by isolating the how from the underlying storage, this decoupling gives many benefits and in hindsight is obvious, but wasn’t always so, until someone(s) figured it out2 3:
postgres=# select 1 + 1;
?column?
----------
2
(1 row)
To answer our question, a tiny, not at all functional, but illustrative, query engine in less than 500 lines of rust:
select count(distinct col) from table;
The goals of a query engine specify the need to be correct and fast as data grows. Correctness is an interesting word, and it has a context that’s rooted in a two part formalism, called relational algebra and relational calculus.
Of interest is how the first formalism describes a number of operations on a unordered collection of sets:
- selection
- projection
- union
- intersection
- difference
- product(cross product)
- join
- division(*)
and a few useful modern extensions, like sorting, windows, aggregates etc.
To answer this query, it seems we need to plan several things, two logical operators or logical nodes which define this transformation:
- select - to specify what we want
- projection - to specify a few details about what is of interest
and a function, in this case an aggregate function called COUNT(expr)
, and finally some
way to represent relations in this naive engine, we don’t have a real ‘schema’ quite yet or ever will, but you could imagine a relation as:
/*
schema: relation + col -> row mapping
storage(tuples): [{k,v}, {k,v}, {k,v}, {k,v}]
*/
#[derive(Debug, Clone)]
pub struct Relation {
pub col_names: Vec<String>,
pub rows: Vec<Vec<String>>,
}
A selection here does a full scan and filters out based on the predicate:
// selection table/relation + predicate (expr = true | false | unknown)
// σ predicate(R). SQL: SELECT * FROM R WHERE a_id = 'a2'
pub fn select(&self, idx: usize, expr: &str) -> Relation {
let result: Vec<Vec<String>> = self.rows
.iter()
.filter(|row| row[idx] == expr)
.cloned()
.collect();
Relation {
col_names: self.col_names.clone(),
rows: result,
}
}
A projection here is a modifier operation over the set:
// Projection: modification(r/w/order) over cols, changes the shape of output/attributes
// π(a1,a2),. . . , (a)n(R).
// SQL: SELECT b_id-100, a_id FROM R WHERE a_id = 'a2'
pub fn projection(&self, columns: &[usize]) -> Relation {
let result: Vec<Vec<String>> = self.rows
.iter()
.map(|row| {
columns.iter()
.map(|&col_idx| row[col_idx].clone()).collect()
}).collect();
let col_names: Vec<String> = columns
.iter()
.map(|&col_idx| self.col_names[col_idx].clone())
.collect();
Relation {
col_names,
rows: result,
}
}
Now we have a logical plan of operations and transformations on this query, but it’s defined in a syntax for these operations,
re-enter SQL, or was it SEQUEL? Of note is the observation, the logical operations are independent of the syntax used to describe them.
We need to first parse the sql, and build a simplified abstract syntax tree where the nodes are the logical operators: selection, projection
and preserving the semantics of applying the count
, luckily this query engine doesn’t need to support the SQL standard or dialects!
and we can cut corners :) , we can just parse out exactly what’s needed, without walking the tree or using a pretty cool generalization over a grammar:
// parser.rs parse SELECT COUNT(DISTINCT col) FROM table;
// and produces a data structure post logical we'd now pass to the
// 'physical/execution' planning stage, select indices etc
// in this example, there's only one possible strategy `SeqScan`
// in a strict sense is a combined logical + physical?
SelectStatement {
projection: AggregateExpression {
function: Aggregation::Count,
column: Column {
name: "col".to_string(),
distinct: true,
},
},
table: "table".to_string(),
};
Statistics & Costs
Lastly, all that’s left is to count
. Which brings us to feature two – performance. A historical glance reveals some influential architectural decisions, we’ve established the need to seperate the logical what of a query from the physical/execution how the query finds,
in this simplified all-in-one planner, we gloss over that very important detail and further yet, realised sql (and dialects) are really syntactic abstractions.
Why is the performance of counting interesting?
The situation gets much more complex when operations like projections, selections, multiple joins in combination with various boolean operations appear in queries. As an example, the relational system system R has a sophisticated query optimiser. In order to perform its task, that programme keeps several statistics on relations of the database. The most important ones are the sizes of relations as well as the number of different elements of some key fields [8]. This information is used to determine the selectivity of attributes at any given time in order to decide the choice of keys and the choice of the appropriate algorithms to be employed when computing relational operators. The choices are made in order to minimise a certain cost function that depends on specific CPU and disk access costs as well as sizes and cardinalities of relations or fields. In system R, this information is periodically recomputed and kept in catalogues that are companions to the database records and indexes1
In postgres this subsystem is called the Cumulative Statistics System, hopefully this contextualizes why keeping track of counts and making them fast is important. It’s not just to serve the sql query aggregate function COUNT
, it’s also quite useful internally for the planner as well.
Naive Counting
There are two flavors of counting, we’re interested in:
- size (counting all elements)
- cardinality (roughly, counting unique elements)
Counting elements for an exact size, could be as simple as a counter, an ADD
instruction is very fast, but if we’re storing alot of different counts, wouldn’t it be nice if we could save on memory too? what if you don’t care about an exact count? say we only desire a rough count over some interval to make some informed decisions?
On the other side of the coin, how do we count unique elements?
// This is computationally inefficient in time
func countUniqStack(arr []int) int {
mystack := MyStack{}
for _, element := range arr {
// expensive check
if !mystack.contains(element) {
mystack.stack = append(mystack.stack, element)
}
}
return len(mystack.stack)
}
Perhaps a hashmap which is O(1)
?
// This is computationally inefficient in space
// For a large set s, we must `make` unused space 'incase'
func countUniqMap(arr []int) int {
seen := make(map[int]bool)
uniqs := []int{}
for _, element := range arr {
if !seen[element] {
uniqs = append(uniqs, element)
seen[element] = true
}
}
return len(stack)
}
It seems like a stack and hashmap won’t work, how does one store less and compute only what’s necessary?
Probabilistic counting
Two interesting and clever data structures, relax the requirement of counting exact elements in a stream by using probabilistic schemes that offer a sketch, the Morris Counter4 and the HyperLogLog. The morris counter saves on the space that’s required to represent or hold the reprentation of a stream, it gets a little “mathy” if you’re interested this excellent blog post has a great explaination to intuit the math.
The hyperloglog on the other hand allows for the estimation of cardinality of datasets to the tune of over a billion! using only ~1.5kilobytes, and a margin of error of roughly 98% accuracy, those are incredible numbers, 5 how does it work?
The input to this algorithm is a continuous stream (elements are read sequentially) say:
["hello", "bye, "hello", "world", "universe", "foo"]
We want to perform a single pass over these elements(multiset in the paper) and the output an estimate of the cardinality of unique items when we’re done by utilising a hash function to produce a uniformly random binary over each element:
hash_fn : Domain → {0, 1}∞
Which might produces a binary stream(S) like:
[101010, 100000, 00100, 0000101, 0100101]
The paper draws attention on making some observations about patterns in the bits produce which allow us to infer a plausible estimate of the unknown cardinality n. These observations are:
- Bit-pattern observables
- Order statistics observables
In particular we’re focused on the first bit-pattern observables:
Once we’ve identified this pattern in the hashed bit, we can then combine, several “estimation passess” by making each “guess” in parallel and later combining them using a pretty neat formula, it’s a short algorithm but requires some clever bit shifting and finding a uniform hash that behaves properly.
HyperLogLog is now a fairly standard data structure in analytics databases and realtime/main memory databases, a few examples of adoption in the postgres ecosystem are: citus, crunchydata and timescaleDB, broadly at meta(presto), in google at Big Query, Redis and much more.