UQA — Unified Query Algebra

A multi-paradigm database engine that unifies relational, text retrieval, vector search, graph query, and geospatial paradigms under a single algebraic structure, using posting lists as the universal abstraction.

GitHub PyPI Benchmarks Theory

Core Theory

UQA is built on four interconnected theoretical papers that establish a mathematical foundation spanning database theory, information retrieval, probabilistic inference, and neural computation.

Posting Lists as Universal Abstraction

Sorted sequences of $(id, payload)$ pairs form a complete Boolean algebra. Any combination of AND, OR, NOT across paradigms is algebraically well-defined.

Graph-Posting List Isomorphism

Graph traversals and pattern matches map to posting lists via a structure-preserving isomorphism, integrating seamlessly with text and vector operations.

Bayesian BM25

BM25 scores transform to calibrated probabilities via Bayesian inference, enabling principled multi-signal fusion without ad-hoc normalization.

Neural Emergence

Neural network structure emerges analytically from multi-signal Bayesian inference. Activation functions are derived as answers to probabilistic questions.

Read the full theoretical foundations →

Architecture

graph TD
    SQL["SQL Parser — pglast"] --> Compiler[SQL Compiler]
    QB["QueryBuilder — Fluent API"] --> Operators

    Compiler --> Optimizer[Query Optimizer]
    Optimizer --> Operators[Operator Tree]
    Operators --> Executor[Plan Executor]
    Executor --> PAR["Parallel Executor — ThreadPool"]
    Operators --> Cypher["Cypher Compiler — openCypher"]

    PAR --> DS["Document Store — SQLite"]
    PAR --> II["Inverted Index — SQLite + Analyzer"]
    PAR --> VI["Vector Index — IVF"]
    PAR --> GS["Graph Store — SQLite"]

    subgraph Scoring
        BM25[BM25]
        BBFS[Bayesian BM25]
        VS[Vector Scorer]
    end

    subgraph Fusion
        LO[Log-Odds]
        PB[Probabilistic Boolean]
    end

    Operators --> Scoring
    Operators --> Fusion
  

SQL Frontend

PostgreSQL 17 compatible SQL parsed by pglast. Supports DDL, DML, JOINs, CTEs, window functions, subqueries, set operations.

Operator Algebra

Composable operators over posting lists. Boolean, primitive, scoring, fusion, graph, and aggregation operators chain freely.

Volcano Execution

Iterator-based physical engine with Apache Arrow columnar batches. External merge sort, Grace hash aggregation, disk spilling.

Storage Layer

SQLite-backed persistence for documents, inverted indexes, vectors (IVF), and graph adjacency. Per-table isolation, ACID transactions.

Text Analysis

Pluggable pipeline: CharFilter, Tokenizer, TokenFilter, Analyzer. Built-in standard tokenizer, lowercase, stop words, Porter stemming.

Parallel Execution

Independent branches (Union, Intersect, Fusion) run concurrently via ThreadPoolExecutor. Configurable worker count.

Named Graphs

Isolated graph namespaces via create_graph() / drop_graph() with SQLite persistence. PageRank, HITS, betweenness centrality, and path indexes operate per named graph.

Cross-Paradigm Operators

Each paradigm maps into the posting list space through primitive operators. Because every operator produces a posting list, they compose freely.

OperatorDefinitionParadigm
$T(t)$ $PL(\lbrace d \in \mathcal{D} \mid t \in term(d, f) \rbrace)$ Text retrieval
$V_\theta(q)$ $PL(\lbrace d \in \mathcal{D} \mid sim(vec(d, f), q) \geq \theta \rbrace)$ Vector search
$KNN_k(q)$ $PL(D_k)$ where $|D_k| = k$, ranked by similarity Vector search
$Filter_{f,v}(L)$ $L \cap PL(\lbrace d \in \mathcal{D} \mid d.f = v \rbrace)$ Relational
$Score_q(L)$ $(L, [s_1, \ldots, s_{|L|}])$ Scoring
$\Phi(L_G)$ $PL\bigl(\bigcup_i \phi_{G \rightarrow D}(G_i)\bigr)$ Graph

A hybrid text + vector search is simply an intersection:

$$Hybrid_{t,q,\theta} = T(t) \cap V_\theta(q)$$

Fusion Functions

Multi-signal fusion combines scores from different paradigms. Log-odds conjunction resolves the conjunction shrinkage problem where naive probability multiplication underestimates relevance:

FunctionDescription
fuse_log_odds(s1, s2, ...)Log-odds conjunction — calibrated multi-signal fusion
fuse_prob_and(s1, s2, ...)$P = \prod P_i$
fuse_prob_or(s1, s2, ...)$P = 1 - \prod(1 - P_i)$
fuse_prob_not(signal)$P = 1 - P_{signal}$
fuse_log_odds(s1, s2, 'relu')Log-odds with ReLU/Swish gating
pagerank() / hits() / betweenness()Graph centrality as FROM or WHERE signal
progressive_fusion(s1, s2, k, ...)Cascading multi-stage WAND pruning
weighted_rpq('expr', start, 'prop')RPQ with cumulative edge weight tracking

Query Optimizer

DPccp Join Ordering (Moerkotte & Neumann, 2006)

The join order optimizer uses the DPccp algorithm — dynamic programming over connected subgraph complement pairs. It finds the optimal bushy join tree in $O(3^n)$ time, compared to $O(n!)$ for exhaustive enumeration. For queries with more than 16 relations, it falls back to a greedy $O(n^3)$ heuristic.

Bitmask DP Table

Relation subsets encoded as integer bitmasks for O(1) hash lookup and set operations, avoiding frozenset overhead in the inner loop.

Bytearray Connectivity

O(1) array-indexed connectivity check via bytearray(2^n) lookup table, replacing hash-based set membership tests.

Incremental Enumeration

Connected subgraphs built bottom-up via BFS extension with neighbor > min(S) invariant — each subgraph generated exactly once.

Canonical Submask Split

Enumeration of (S1, S2) splits uses the sub | lowest_bit trick to visit only the canonical half, halving the search space.

Cost-Based Optimization

SQL Interface

PostgreSQL 17 compatible SQL with cross-paradigm extensions. 95 of 103 target PostgreSQL 17 features implemented.

Multi-Signal Fusion

SELECT title, _score FROM papers
WHERE fuse_log_odds(
    text_match(title, 'attention'),
    knn_match(embedding, ARRAY[0.1, 0.2, ...], 5),
    traverse_match(1, 'cited_by', 2)
) AND year >= 2020
ORDER BY _score DESC;

Window Functions

SELECT rep, sale_date, amount,
       SUM(amount) OVER (ORDER BY sale_date
           ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS running_total
FROM sales;

Recursive CTE

WITH RECURSIVE org_tree AS (
    SELECT id, name, 1 AS depth FROM org_chart WHERE manager_id IS NULL
    UNION ALL
    SELECT o.id, o.name, t.depth + 1
    FROM org_chart o INNER JOIN org_tree t ON o.manager_id = t.id
)
SELECT name, depth FROM org_tree ORDER BY depth;

Graph Query (openCypher / Apache AGE)

— Create a named graph (persisted in SQLite)
SELECT * FROM create_graph('social');

— Populate the graph with Cypher CREATE
SELECT * FROM cypher('social', $$
    CREATE (a:Person {name: 'Alice', age: 30})
    CREATE (b:Person {name: 'Bob', age: 28})
    CREATE (a)-[:KNOWS]->(b)
$$) AS (v agtype);

— Query the named graph
SELECT * FROM cypher('social', $$
    MATCH (p:Person)-[:KNOWS]->(friend:Person)
    WHERE p.age > 25
    RETURN p.name, friend.name, p.age
    ORDER BY p.name
$$) AS (name agtype, friend agtype, age agtype);

— Drop the graph and all its data
SELECT * FROM drop_graph('social');

Named graphs provide isolated graph namespaces with full SQLite persistence. Each graph stores vertices and edges in dedicated tables (_graph_vertices_*, _graph_edges_*), surviving engine restarts. Graph centrality functions (pagerank, hits, betweenness) and path indexes operate on named graphs via the 'graph:name' syntax.

Full-Text Search (@@ Operator)

— Query string mini-language: boolean, phrase, field targeting
SELECT title, _score FROM articles
WHERE title @@ 'database AND query' ORDER BY _score DESC;

— Hybrid text + vector fusion via @@
SELECT title, _score FROM articles
WHERE _all @@ 'body:search AND embedding:[0.1, 0.9, 0.0, 0.0]'
ORDER BY _score DESC;

Geospatial Query

CREATE TABLE restaurants (
    id SERIAL PRIMARY KEY,
    name TEXT NOT NULL,
    location POINT
);

CREATE INDEX idx_loc ON restaurants USING rtree (location);

SELECT name, ROUND(ST_Distance(location, POINT(-73.9857, 40.7484)), 0) AS dist_m
FROM restaurants
WHERE spatial_within(location, POINT(-73.9857, 40.7484), 5000)
ORDER BY dist_m;

Temporal Graph Traversal

— Edges valid at timestamp
SELECT * FROM temporal_traverse(1, 'knows', 2, 1700000000);

— GNN message passing: aggregate neighbor properties
SELECT * FROM papers
WHERE message_passing(2, 'mean', 'citations');

— Structural graph embeddings
SELECT * FROM papers
WHERE graph_embedding(16, 3);

Graph Centrality

SELECT title, _score FROM pagerank()
ORDER BY _score DESC;

SELECT title, _score FROM papers
WHERE fuse_log_odds(
    text_match(title, 'attention'),
    pagerank(),
    'relu'
) ORDER BY _score DESC;

Foreign Data Wrapper

CREATE SERVER warehouse FOREIGN DATA WRAPPER duckdb_fdw;

CREATE FOREIGN TABLE sales (
    id INTEGER, name TEXT, amount INTEGER,
    year INTEGER, month INTEGER
) SERVER warehouse OPTIONS (
    source '/data/sales/**/*.parquet',
    hive_partitioning 'true'
);

— DuckDB prunes Hive partitions at source
SELECT name, SUM(amount) FROM sales
WHERE year IN (2024, 2025) AND month > 6
GROUP BY name ORDER BY SUM(amount) DESC;

Foreign Data Wrappers

UQA federates external data sources through the FDW abstraction, bridging pyarrow.Table results into the posting list pipeline.

DuckDB FDW

In-process access to Parquet, CSV, JSON, and S3 objects. Hive partition discovery with predicate pushdown for partition pruning.

Arrow Flight SQL FDW

Remote access to any Arrow Flight SQL server — Dremio, DataFusion, DuckDB, and other compatible engines over gRPC.

Full Query Pushdown

Entire queries — JOINs, aggregates, window functions, subqueries — delegated to DuckDB or Arrow Flight SQL via AST deparsing. Zero Python row materialization.

Mixed Foreign-Local

Local dimension tables shipped to DuckDB as temp tables for in-process JOIN execution with large foreign fact tables.

Predicate Pushdown

Comparison (=, !=, <, >), IN, LIKE, ILIKE, and BETWEEN predicates are pushed down to the external engine, enabling Hive partition pruning and reducing data transfer.

Full Query Pushdown (v0.19.0)

When all tables in a query live on the same DuckDB server, the entire SQL — including JOINs, GROUP BY, ORDER BY, window functions, subqueries, and CTEs — is deparsed via pglast.stream.RawStream and executed natively by DuckDB. For mixed foreign-local queries, small local tables (<100K rows) are shipped to DuckDB as temp tables. Arrow Flight SQL supports same-server pure-foreign pushdown. If DuckDB cannot execute (e.g., UQA-specific functions like spatial_within), the standard UQA operator pipeline takes over automatically.

Benchmarks

295 pytest-benchmark tests across 14 subsystems track performance across commits. The CI pipeline detects regressions with a 150% threshold and blocks PRs that exceed it.

Posting List

Union, intersect, complement operations at 1K–100K entries. Two-pointer merge and boolean algebra.

Storage

SQLite document store and inverted index: insert, bulk load, lookup, and range scan throughput.

SQL Compiler

Parse + compile latency for SELECT, JOIN, CTE, window functions, and complex WHERE clauses.

Execution Engine

Volcano physical operators: sequential scan, filter, project, hash aggregate, sort, and disk spilling.

Join Planner

DPccp enumeration for star and chain join graphs with 3–16 relations. Greedy fallback at 20–30 relations.

Scoring

BM25, Bayesian BM25, vector scoring, and multi-signal fusion across varying corpus sizes.

Graph

BFS traversal, pattern matching (MRV + arc consistency), regular path queries on social-graph topologies.

Graph Centrality

PageRank, HITS, betweenness, bounded RPQ, weighted paths, subgraph index, incremental matching, and progressive fusion.

End-to-End SQL

Full query lifecycle: parse, optimize, execute, and materialize for analytical and hybrid queries.

View live benchmark dashboard →