UQA — Unified Query Algebra

Core Theory

UQA is built on five interconnected theoretical papers that establish a mathematical foundation spanning database theory, information retrieval, probabilistic inference, neural computation, and vector calibration.

Posting Lists as Universal Abstraction

Sorted sequences of $(id, payload)$ pairs form a complete Boolean algebra. Any combination of AND, OR, NOT across paradigms is algebraically well-defined.

Read Paper →

Graph-Posting List Isomorphism

Graph traversals and pattern matches map to posting lists via a structure-preserving isomorphism, integrating seamlessly with text and vector operations.

Read Paper →

Bayesian BM25

BM25 scores transform to calibrated probabilities via Bayesian inference, enabling principled multi-signal fusion without ad-hoc normalization.

Read Paper →

Neural Emergence

Neural network structure emerges analytically from multi-signal Bayesian inference. Activation functions are derived as answers to probabilistic questions.

Read Paper →

Vector Calibration

Vector similarity scores transform to calibrated relevance probabilities via likelihood ratios over ANN index statistics, completing the probabilistic unification of sparse and dense retrieval.

Read Paper →

Read the full theoretical foundations →

Architecture

graph TD
    SQL["SQL Parser — pglast"] --> Compiler[SQL Compiler]
    QB["QueryBuilder — Fluent API"] --> Operators

    Compiler --> Optimizer[Query Optimizer]
    Optimizer --> Operators[Operator Tree]
    Operators --> Executor[Plan Executor]
    Executor --> PAR["Parallel Executor — ThreadPool"]
    Operators --> Cypher["Cypher Compiler — openCypher"]

    PAR --> DS["Document Store — SQLite"]
    PAR --> II["Inverted Index — SQLite + Analyzer"]
    PAR --> VI["Vector Index — IVF"]
    PAR --> GS["Graph Store — Named Graphs + SQLite"]

    subgraph Scoring
        BM25[BM25]
        BBFS[Bayesian BM25]
        VS[Vector Scorer]
    end

    subgraph Fusion
        LO[Log-Odds]
        PB[Probabilistic Boolean]
        ATT[Attention]
        DF[Deep Fusion]
    end

    Operators --> Scoring
    Operators --> Fusion

SQL Frontend

PostgreSQL 17 compatible SQL parsed by pglast. Supports DDL, DML, JOINs, CTEs, window functions, subqueries, set operations.

Operator Algebra

Composable operators over posting lists. Boolean, primitive, scoring, fusion, graph, and aggregation operators chain freely.

Volcano Execution

Iterator-based physical engine with Apache Arrow columnar batches. External merge sort, Grace hash aggregation, disk spilling.

Storage Layer

SQLite-backed persistence for documents, inverted indexes, vectors (IVF), and graph adjacency. Per-table isolation, ACID transactions.

Text Analysis

Pluggable pipeline: CharFilter, Tokenizer, TokenFilter, Analyzer. Built-in standard tokenizer, lowercase, stop words, Porter stemming.

Parallel Execution

Independent branches (Union, Intersect, Fusion) run concurrently via ThreadPoolExecutor. Configurable worker count.

Named Graphs

Per-graph partitioned adjacency via _GraphPartition. Vertices/edges shared globally; adjacency indexes per-graph. Graph algebra (union, intersect, difference), property indexes, and all SQL functions use direct graph names.

Deep Fusion

Multi-layer neural network as SQL: ResNet (signal layers), GNN (propagation), CNN (convolution), pooling, dense layers, flatten, softmax, batch normalization, and dropout — all composed as deep_fusion() layer functions.

Deep Learning

Analytical CNN training via deep_learn() — no backpropagation. Multi-channel conv with configurable kernel initialization (Kaiming, orthogonal, Gabor, k-means), global pooling (global_pool()), self-attention (context-dependent PoE, Theorem 8.3), and ridge regression (posterior). PyTorch GPU acceleration (MPS/CUDA). Inference via deep_predict() or deep_fusion(model()).

Cross-Paradigm Operators

Each paradigm maps into the posting list space through primitive operators. Because every operator produces a posting list, they compose freely.

Operator	Definition	Paradigm
$T(t)$	$PL(\lbrace d \in \mathcal{D} \mid t \in term(d, f) \rbrace)$	Text retrieval
$V_\theta(q)$	$PL(\lbrace d \in \mathcal{D} \mid sim(vec(d, f), q) \geq \theta \rbrace)$	Vector search
$KNN_k(q)$	$PL(D_k)$ where $\|D_k\| = k$, ranked by similarity	Vector search
$Filter_{f,v}(L)$	$L \cap PL(\lbrace d \in \mathcal{D} \mid d.f = v \rbrace)$	Relational
$Score_q(L)$	$(L, [s_1, \ldots, s_{\|L\|}])$	Scoring
$\Phi(L_G)$	$PL\bigl(\bigcup_i \phi_{G \rightarrow D}(G_i)\bigr)$	Graph

A hybrid text + vector search is simply an intersection:

Hybrid_{t,q,\theta} = T(t) \cap V_\theta(q)

Fusion Functions

Multi-signal fusion combines scores from different paradigms. Log-odds conjunction resolves the conjunction shrinkage problem where naive probability multiplication underestimates relevance:

Function	Description
`fuse_log_odds(s1, s2, ..., gating => 'swish')`	Log-odds conjunction with optional gating and gating_beta
`fuse_prob_and(s1, s2, ...)`	$P = \prod P_i$
`fuse_prob_or(s1, s2, ...)`	$P = 1 - \prod(1 - P_i)$
`fuse_prob_not(signal)`	$P = 1 - P_{signal}$
`fuse_attention(s1, s2, ..., normalized => true)`	Attention-weighted fusion with logit normalization
`fuse_multihead(s1, s2, ..., n_heads => 4)`	Multi-head attention fusion (Remark 8.6)
`fuse_learned(s1, s2, ..., alpha => 0.7)`	Learned per-signal weights via Hebbian gradient
`pagerank(['graph']) / hits(['graph']) / betweenness(['graph'])`	Graph centrality as FROM or WHERE signal (named graph optional)
`progressive_fusion(s1, s2, k, ...)`	Cascading multi-stage WAND pruning
`weighted_rpq('expr', start, 'prop')`	RPQ with cumulative edge weight tracking
`deep_fusion(layer(...), convolve(...), pool(...), ...)`	Multi-layer neural network: ResNet + GNN + CNN + DenseNet

Deep Fusion Layer Functions

Used inside deep_fusion() to compose neural network pipelines over graph-structured data. Each layer type maps to a standard DL architecture:

Function	Description
`layer(sig1, sig2, ...)`	Signal layer: log-odds conjunction with residual connection (ResNet)
`propagate('label', 'agg'[, 'dir'])`	Graph propagation: spread scores through edges (GNN)
`convolve('label', ARRAY[w...][, 'dir'])`	Spatial convolution: weighted multi-hop BFS aggregation (CNN)
`pool('label', 'method', size[, 'dir'])`	Spatial downsampling via greedy BFS partitioning
`dense(ARRAY[W], ARRAY[b], output_channels => N, input_channels => M)`	Fully connected layer: `out = W @ input + bias`
`flatten()`	Collapse spatial nodes into a single vector
`softmax()`	Classification head (numerically stable)
`batch_norm([epsilon => 1e-5])`	Per-channel normalization across nodes
`dropout(p)`	Inference-mode scaling by $(1 - p)$
`model('name', $1)`	Load trained model and create full inference pipeline
`embed(vector, in_channels => C, grid_h => H, grid_w => W)`	Inject raw embedding vector into channel map

Deep Learning Functions

Train CNN classifiers analytically (no backpropagation) and run inference via SQL aggregate and scalar functions:

Function	Description
`deep_learn('model', label, embedding, 'edge_label', layers..., gating => 'relu', lambda => 1.0)`	SELECT aggregate: analytical CNN training via ridge regression + random multi-channel conv
`deep_predict('model', embedding)`	Per-row scalar: inference with trained model, returns class probabilities
`build_grid_graph('table', rows, cols, 'label')`	FROM-clause: construct 4-connected grid graph for spatial convolution

Layer functions available inside deep_learn() and deep_fusion():

Layer	Description
`convolve(n_channels => N, init => 'kaiming')`	Multi-channel spatial convolution. Init modes: kaiming (default), orthogonal, gabor, kmeans
`pool('max', 2)`	Spatial downsampling (max or avg pooling)
`global_pool('avg'\|'max'\|'avg_max')`	Channel-preserving spatial reduction: avg (per-channel mean), max, avg_max (both concatenated)
`attention(n_heads => 4, mode => 'random_qk')`	Self-attention: context-dependent PoE (Section 8). Modes: content, random_qk, learned_v
`flatten()`, `dense(N)`, `softmax()`	Classification head: flatten spatial features, dense projection, softmax

Query Optimizer

DPccp Join Ordering (Moerkotte & Neumann, 2006)

The join order optimizer uses the DPccp algorithm — dynamic programming over connected subgraph complement pairs. It finds the optimal bushy join tree in $O(3^n)$ time, compared to $O(n!)$ for exhaustive enumeration. For queries with more than 16 relations, it falls back to a greedy $O(n^3)$ heuristic.

Bitmask DP Table

Relation subsets encoded as integer bitmasks for O(1) hash lookup and set operations, avoiding frozenset overhead in the inner loop.

Bytearray Connectivity

O(1) array-indexed connectivity check via bytearray(2^n) lookup table, replacing hash-based set membership tests.

Incremental Enumeration

Connected subgraphs built bottom-up via BFS extension with neighbor > min(S) invariant — each subgraph generated exactly once.

Canonical Submask Split

Enumeration of (S1, S2) splits uses the sub | lowest_bit trick to visit only the canonical half, halving the search space.

Cost-Based Optimization

Histogram-aware selectivity: equi-depth histograms + Most Common Values (MCV) from ANALYZE
Recursive filter pushdown: through nested IntersectOperators to all leaf operators
Vector threshold merge: same query vector with floating-point tolerance (np.allclose)
Cost-based intersection reordering: sort by execution cost, not cardinality
Intersection early termination: skip remaining operands when accumulator is empty
Predicate-aware cardinality damping: same-column correlation vs cross-column independence
Join-algorithm-aware DPccp: index join cost for small inputs, hash join cost for large inputs
Fusion signal reordering: cheapest-first signal evaluation
R*Tree spatial index scan: for POINT column range queries
B-tree index scan substitution: replace full scans when profitable
FDW predicate pushdown: comparison, IN, LIKE, ILIKE, BETWEEN pushed to DuckDB/Arrow Flight SQL
FDW full query pushdown: same-server queries delegated entirely via AST deparsing
FDW mixed foreign-local: small local tables shipped to DuckDB for in-process JOINs
Cross-paradigm cardinality: text, vector, graph, fusion, temporal, and GNN operators
Edge filter pushdown: graph edge property filters into pattern constraints
Join-pattern fusion: merge intersected pattern matches with shared variables
Cross-paradigm join cost models: text similarity, vector similarity, graph, hybrid joins
Threshold-aware vector selectivity: 4-tier estimation buckets
Temporal cardinality correction: timestamp/range selectivity for temporal graphs
Path index acceleration: for Concat-of-Labels RPQ expressions and Cypher MATCH patterns
CTE inlining: for single-reference non-recursive CTEs
Predicate pushdown: into views and derived tables
Implicit cross join reordering: via DPccp when equijoin predicates exist in WHERE
Filter pushdown into traverse: vertex property filters absorbed into BFS pruning predicates
Graph-aware fusion reordering: graph operators get cost discount when graph statistics available
Named graph statistics: per-graph degree distribution, label degree, vertex label counts
Information-theoretic cardinality bounds: column entropy floors intersection estimates
Negation-aware pattern cost: negated edge patterns add scan overhead to cost model
RPQ expression simplification: algebraic simplification and NFA-to-DFA conversion for small automata

SQL Interface

PostgreSQL 17 compatible SQL with cross-paradigm extensions. 95 of 103 target PostgreSQL 17 features implemented.

Multi-Signal Fusion

SELECT title, _score FROM papers
WHERE fuse_attention(
    text_match(title, 'attention'),
    bayesian_knn_match(embedding, $1, 10, method => 'auto'),
    traverse_match(1, 'cited_by', 2),
    normalized => true
) AND year >= 2020
ORDER BY _score DESC;

Window Functions

SELECT rep, sale_date, amount,
       SUM(amount) OVER (ORDER BY sale_date
           ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS running_total
FROM sales;

Recursive CTE

WITH RECURSIVE org_tree AS (
    SELECT id, name, 1 AS depth FROM org_chart WHERE manager_id IS NULL
    UNION ALL
    SELECT o.id, o.name, t.depth + 1
    FROM org_chart o INNER JOIN org_tree t ON o.manager_id = t.id
)
SELECT name, depth FROM org_tree ORDER BY depth;

Graph Query (openCypher / Apache AGE)

— Create a named graph (persisted in SQLite)
SELECT * FROM create_graph('social');

— Populate the graph with Cypher CREATE
SELECT * FROM cypher('social', $$
    CREATE (a:Person {name: 'Alice', age: 30})
    CREATE (b:Person {name: 'Bob', age: 28})
    CREATE (a)-[:KNOWS]->(b)
$$) AS (v agtype);

— Query the named graph
SELECT * FROM cypher('social', $$
    MATCH (p:Person)-[:KNOWS]->(friend:Person)
    WHERE p.age > 25
    RETURN p.name, friend.name, p.age
    ORDER BY p.name
$$) AS (name agtype, friend agtype, age agtype);

— Drop the graph and all its data
SELECT * FROM drop_graph('social');

Named graphs provide isolated graph namespaces with full SQLite persistence. Each graph stores vertices and edges in dedicated tables (_graph_vertices_*, _graph_edges_*), surviving engine restarts. Graph centrality functions (pagerank, hits, betweenness) and path indexes operate on named graphs using direct graph names (e.g., traverse(1, 'knows', 2, 'social')).

Full-Text Search (@@ Operator)

— Query string mini-language: boolean, phrase, field targeting
SELECT title, _score FROM articles
WHERE title @@ 'database AND query' ORDER BY _score DESC;

— Hybrid text + vector fusion via @@
SELECT title, _score FROM articles
WHERE _all @@ 'body:search AND embedding:[0.1, 0.9, 0.0, 0.0]'
ORDER BY _score DESC;

Geospatial Query

CREATE TABLE restaurants (
    id SERIAL PRIMARY KEY,
    name TEXT NOT NULL,
    location POINT
);

CREATE INDEX idx_loc ON restaurants USING rtree (location);

SELECT name, ROUND(ST_Distance(location, POINT(-73.9857, 40.7484)), 0) AS dist_m
FROM restaurants
WHERE spatial_within(location, POINT(-73.9857, 40.7484), 5000)
ORDER BY dist_m;

Deep Fusion Neural Network

— Full CNN pipeline expressed as SQL
SELECT id, _score FROM patches
WHERE deep_fusion(
    layer(knn_match(embedding, $1, 16)),
    convolve('spatial', ARRAY[0.6, 0.4]),
    pool('spatial', 'max', 2),
    flatten(),
    dense(ARRAY[...], ARRAY[...],
          output_channels => 4, input_channels => 8),
    softmax(),
    gating => 'relu'
) ORDER BY _score DESC;

Each SQL function call maps to a neural network layer. The query optimizer produces an EXPLAIN plan that reads like a network architecture diagram. Signal layers are ResNet (residual connections), propagate layers are GNN (message passing), convolve layers are CNN (spatial aggregation), and the dense-softmax head is a standard classifier.

Deep Learning: Training + Inference

-- Train a CNN classifier analytically (no backpropagation)
SELECT deep_learn(
    'mnist_cnn', label, embedding, 'spatial',
    convolve(n_channels => 32),
    pool('max', 2),
    convolve(n_channels => 64),
    pool('max', 2),
    flatten(),
    dense(output_channels => 10),
    softmax(),
    gating => 'relu', lambda => 1.0
) FROM mnist_train;

-- Inference: per-row scalar function
SELECT id, deep_predict('mnist_cnn', embedding) AS pred
FROM test_data;

-- Inference: via deep_fusion pipeline with learned weights
SELECT id, _score, class_probs FROM grid_28x28
WHERE deep_fusion(
    model('mnist_cnn', $1),
    gating => 'relu'
) ORDER BY _score DESC;

Training is analytical: multi-channel convolution kernels serve as Bayesian prior (Kaiming, orthogonal, Gabor, or k-means initialization), ridge regression computes the posterior. No gradient descent, no learning rate, no epochs. global_pool() provides channel-preserving spatial reduction as an alternative to flatten(). Self-attention layers inject global context between conv+pool stages via context-dependent PoE (Paper 4, Section 8). MNIST achieves 97.89% test accuracy.

Temporal Graph Traversal

— Edges valid at timestamp
SELECT * FROM temporal_traverse(1, 'knows', 2, 1700000000);

— GNN message passing: aggregate neighbor properties
SELECT * FROM papers
WHERE message_passing(2, 'mean', 'citations');

— Structural graph embeddings
SELECT * FROM papers
WHERE graph_embedding(16, 3);

Graph Centrality

SELECT title, _score FROM pagerank()
ORDER BY _score DESC;

— Named graph in WHERE signal and fusion
SELECT title, _score FROM papers
WHERE pagerank('social') ORDER BY _score DESC;

SELECT title, _score FROM papers
WHERE fuse_log_odds(
    text_match(title, 'attention'),
    pagerank('social'),
    gating => 'relu'
) ORDER BY _score DESC;

Foreign Data Wrapper

CREATE SERVER warehouse FOREIGN DATA WRAPPER duckdb_fdw;

CREATE FOREIGN TABLE sales (
    id INTEGER, name TEXT, amount INTEGER,
    year INTEGER, month INTEGER
) SERVER warehouse OPTIONS (
    source '/data/sales/**/*.parquet',
    hive_partitioning 'true'
);

— DuckDB prunes Hive partitions at source
SELECT name, SUM(amount) FROM sales
WHERE year IN (2024, 2025) AND month > 6
GROUP BY name ORDER BY SUM(amount) DESC;

Foreign Data Wrappers

UQA federates external data sources through the FDW abstraction, bridging pyarrow.Table results into the posting list pipeline.

DuckDB FDW

In-process access to Parquet, CSV, JSON, and S3 objects. Hive partition discovery with predicate pushdown for partition pruning.

Arrow Flight SQL FDW

Remote access to any Arrow Flight SQL server — Dremio, DataFusion, DuckDB, and other compatible engines over gRPC.

Full Query Pushdown

Entire queries — JOINs, aggregates, window functions, subqueries — delegated to DuckDB or Arrow Flight SQL via AST deparsing. Zero Python row materialization.

Mixed Foreign-Local

Local dimension tables shipped to DuckDB as temp tables for in-process JOIN execution with large foreign fact tables.

Predicate Pushdown

Comparison (=, !=, <, >), IN, LIKE, ILIKE, and BETWEEN predicates are pushed down to the external engine, enabling Hive partition pruning and reducing data transfer.

Full Query Pushdown

When all tables in a query live on the same DuckDB server, the entire SQL — including JOINs, GROUP BY, ORDER BY, window functions, subqueries, and CTEs — is deparsed via pglast.stream.RawStream and executed natively by DuckDB. For mixed foreign-local queries, small local tables (<100K rows) are shipped to DuckDB as temp tables. Arrow Flight SQL supports same-server pure-foreign pushdown. If DuckDB cannot execute (e.g., UQA-specific functions like spatial_within), the standard UQA operator pipeline takes over automatically.

Benchmarks

309 pytest-benchmark tests across 15 subsystems track performance across commits. The CI pipeline detects regressions with a 150% threshold and blocks PRs that exceed it.

Posting List

Union, intersect, complement operations at 1K–100K entries. Two-pointer merge and boolean algebra.

Storage

SQLite document store and inverted index: insert, bulk load, lookup, and range scan throughput.

SQL Compiler

Parse + compile latency for SELECT, JOIN, CTE, window functions, and complex WHERE clauses.

Execution Engine

Volcano physical operators: sequential scan, filter, project, hash aggregate, sort, and disk spilling.

Join Planner

DPccp enumeration for star and chain join graphs with 3–16 relations. Greedy fallback at 20–30 relations.

Scoring

BM25, Bayesian BM25, vector scoring, and multi-signal fusion across varying corpus sizes.

Graph

BFS traversal, pattern matching (MRV + arc consistency), regular path queries on social-graph topologies.

Graph Centrality

PageRank, HITS, betweenness, bounded RPQ, weighted paths, subgraph index, incremental matching, and progressive fusion.

Named Graphs

Graph lifecycle, vertex/edge insertion, scoped traversal, pattern matching, RPQ, graph algebra, and property indexes.

End-to-End SQL

Full query lifecycle: parse, optimize, execute, and materialize for analytical and hybrid queries.

View live benchmark dashboard →