Deconstructing Datalog

In September 2022, after two rounds of revisions, I submitted the final version of my PhD dissertation, Deconstructing Datalog. Datalog is a logic programming language from the ’80s that augments relational algebra with recursive queries. It has both simple semantics and efficient implementation strategies. Like Lisp and the Velvet Underground, its influence exceeds its popularity; its ideas are still being absorbed into the mainstream.

While Datalog has a high power-to-weight ratio, it is fairly limited. For one thing, it has no functions or procedures: no way to abstract out repetitive code. As a fan of typed functional programming, I figured: how hard could it be? We have x (bottom-up logic programming); we have y (functional programming); what is x + y? Like a child playing with Legos, I set out to mash two cool things together to make a bigger, cooler thing: Datafun.

It worked!

However,

there were complications.

And if you want the whole story, you’ll just have to read it. It’s a reasonably short dissertation: 97 pages plus end-matter. But the primary results are that (i) this wacky idea works, and (ii) we can make it go reasonably fast, asymptotically speaking.

Taking Datalog apart and putting it back together again

My dissertation’s central idea is that we can seamlessly integrate Datalog’s features into a typed functional language by working backward from their semantics. Take Datalog’s most distinctive feature, recursive queries; for instance, reachability in a graph:

reachable(start).
reachable(Y) :- reachable(X), edge(X,Y).

This finds the smallest set reachable such that (1) start is reachable, and (2) if X is reachable and has an edge to Y, then Y is also reachable. We can rewrite these conditions as a single inequality on sets:

reachable ⊇ {start} ∪ {y : x ∈ reachable, (x,y) ∈ edge}

We can rewrite this to factor out the right-hand side:

reachable ⊇ f(reachable)
  where f(R) = {start} ∪ {y : x ∈ R, (x,y) ∈ edge}

This asks for a least prefix point: the least R that includes at least f(R) for some function f on relations. In fact, all recursive queries in Datalog fit this pattern. So to capture recursive queries in a functional language, we need (a) an expressive language for functions over relations, and (b) a least prefix point operator, fix. This turns our query into:

fix (λR. {start} ∪ {y | x ∈ R, (x,y) ∈ edge})

In a few short steps, we’ve turned predicates-and-logic into sets-and-functions.

Chapter 2 works out this recipe in detail. Most uniquely, to capture Datalog’s stratification condition, which ensures recursive queries are well-defined, Datafun needs to track monotonicity in its type system. This works compositionally: the composition of two maps is monotone if both maps are monotone; otherwise not. More technically, non-monotonicity can be represented in an otherwise monotone world by (in category theory) a monoidal comonad or (in type theory) a necessity modality or (in the compiler) carefully tracking which variables are safe to use non-monotonically. Like most type systems this is safe but incomplete — the type-checker rejects some monotone programs as non-monotone — but it handles everything Datalog can, and perhaps more.

How to find fixed points fast(er)

An important detail omitted above is fix’s implementation: how we find the least set R such that R ⊇ f(R). Take graph reachability, for instance:

f(R) = {start} ∪ {y | x ∈ R, (x,y) ∈ edge}

The naïve approach is to iterate f:

R₀ = ∅          = nothing
R₁ = f(∅)       = {start}
R₂ = f(f(∅))    = nodes within 1 edge of start
R₃ = f(f(f(∅))) = nodes within 2 edges of start
...
Rᵢ = fⁱ(∅)      = nodes within i-1 edges of start
...

until eventually Rᵢ = Rᵢ₊₁ because we have found the most distant reachable nodes.

This is an extremely inefficient breadth-first search, because it does redundant work. Consider the set-comprehension {y | x ∈ Rᵢ, (x,y) ∈ edges} needed to compute f(Rᵢ). This examines every node in Rᵢ; so in total we do at least as much work as the sum of the sizes of each Rᵢ. Now observe that the sequence Rᵢ grows monotonically, R₀ ⊆ R₁ ⊆ R₂ ⊆ ..., because a node within 1 edge of start is also within 2 edges, et cetera. So if we reach a node in an early iteration, we wastefully re-examine it in every following iteration. On some graphs this produces quadratic blowup (e.g. let edge = {(i, i+1) | i ∈ [1..n]}), each node being examined on average a linear number of times!

The solution is to examine each node only once. Naïve iteration asks “what can I deduce in at most n steps?” for iteratively increasing n. Seminaïve iteration asks: “what can I deduce in n steps that I couldn’t deduce before?” In other words, it asks for the changes between steps of naïve iteration — the frontiers of knowledge. We can compute these frontiers by incrementalizing our deduction function f, figuring out how it responds to change: given a change to its input, how does its output change — i.e. given the previous frontier, what is the next frontier? This in turn we can do by taking the discrete derivative of our program, in a sense made precise by prior work on the incremental λ-calculus. Chapter 3 shows how to apply this work to Datafun, incrementalizing it with respect to monotone (increasing) change.

Sine qua non

My dissertation lacks acknowledgements because I found them hard to write. Any real choice meant leaving someone out, and the non-choice of throwing everyone in felt disingenuous and exhausting. I took the easy way out and said nothing. For this post I’ve chosen a stark criterion: people without whom I would not have written this thesis. This is a pretty short list:

Neel Krishnaswami, my advisor — without whom I might not have written any thesis, let alone this one — for encouraging the fumblings that eventually became Datafun; for our two papers together (so far); and for everything else.
Eve/Kodowa (at the time, Chris Granger, Rob Attorri, and Jamie Brandon) for introducing me to Datalog and turning me down from my dream job, ensuring I kept plugging in academia. Thanks to Jamie for staying in touch.
My thesis examiners, Achim Jung and Jeremy Gibbons, for prompting me to rewrite Chapter 3’s introduction and thus rehabilitate the categorical approach; and for insisting I write a conclusion (Chapter 6, “Looking Back and Forward”).
nwf, for feedback on my final draft.

Many other people supported me through my PhD, but, by and large, they would have supported me in writing any dissertation at all. This does not make their support less valuable. I hope they know who they are.