A multi-paradigm database engine that unifies relational, text retrieval, vector search, graph query, and geospatial paradigms under a single algebraic structure, using posting lists as the universal abstraction.
UQA is built on five interconnected theoretical papers that establish a mathematical foundation spanning database theory, information retrieval, probabilistic inference, neural computation, and vector calibration.
Sorted sequences of $(id, payload)$ pairs form a complete Boolean algebra. Any combination of AND, OR, NOT across paradigms is algebraically well-defined.
Graph traversals and pattern matches map to posting lists via a structure-preserving isomorphism, integrating seamlessly with text and vector operations.
BM25 scores transform to calibrated probabilities via Bayesian inference, enabling principled multi-signal fusion without ad-hoc normalization.
Neural network structure emerges analytically from multi-signal Bayesian inference. Activation functions are derived as answers to probabilistic questions.
Vector similarity scores transform to calibrated relevance probabilities via likelihood ratios over ANN index statistics, completing the probabilistic unification of sparse and dense retrieval.
graph TD
SQL["SQL Parser — pglast"] --> Compiler[SQL Compiler]
QB["QueryBuilder — Fluent API"] --> Operators
Compiler --> Optimizer[Query Optimizer]
Optimizer --> Operators[Operator Tree]
Operators --> Executor[Plan Executor]
Executor --> PAR["Parallel Executor — ThreadPool"]
Operators --> Cypher["Cypher Compiler — openCypher"]
PAR --> DS["Document Store — SQLite"]
PAR --> II["Inverted Index — SQLite + Analyzer"]
PAR --> VI["Vector Index — IVF"]
PAR --> GS["Graph Store — Named Graphs + SQLite"]
subgraph Scoring
BM25[BM25]
BBFS[Bayesian BM25]
VS[Vector Scorer]
end
subgraph Fusion
LO[Log-Odds]
PB[Probabilistic Boolean]
ATT[Attention]
DF[Deep Fusion]
end
Operators --> Scoring
Operators --> Fusion
PostgreSQL 17 compatible SQL parsed by pglast. Supports DDL, DML, JOINs, CTEs, window functions, subqueries, set operations.
Composable operators over posting lists. Boolean, primitive, scoring, fusion, graph, and aggregation operators chain freely.
Iterator-based physical engine with Apache Arrow columnar batches. External merge sort, Grace hash aggregation, disk spilling.
SQLite-backed persistence for documents, inverted indexes, vectors (IVF), and graph adjacency. Per-table isolation, ACID transactions.
Pluggable pipeline: CharFilter, Tokenizer, TokenFilter, Analyzer. Built-in standard tokenizer, lowercase, stop words, Porter stemming.
Independent branches (Union, Intersect, Fusion) run concurrently via ThreadPoolExecutor. Configurable worker count.
Per-graph partitioned adjacency via _GraphPartition.
Vertices/edges shared globally; adjacency indexes per-graph.
Graph algebra (union, intersect, difference), property indexes,
and all SQL functions use direct graph names.
Multi-layer neural network as SQL: ResNet (signal layers),
GNN (propagation), CNN (convolution), pooling, dense layers,
flatten, softmax, batch normalization, and dropout —
all composed as deep_fusion() layer functions.
Analytical CNN training via deep_learn() —
no backpropagation. Multi-channel conv with configurable kernel
initialization (Kaiming, orthogonal, Gabor, k-means),
global pooling (global_pool()),
self-attention (context-dependent PoE, Theorem 8.3),
and ridge regression (posterior). PyTorch GPU acceleration
(MPS/CUDA). Inference via deep_predict() or
deep_fusion(model()).
Each paradigm maps into the posting list space through primitive operators. Because every operator produces a posting list, they compose freely.
| Operator | Definition | Paradigm |
|---|---|---|
| $T(t)$ | $PL(\lbrace d \in \mathcal{D} \mid t \in term(d, f) \rbrace)$ | Text retrieval |
| $V_\theta(q)$ | $PL(\lbrace d \in \mathcal{D} \mid sim(vec(d, f), q) \geq \theta \rbrace)$ | Vector search |
| $KNN_k(q)$ | $PL(D_k)$ where $|D_k| = k$, ranked by similarity | Vector search |
| $Filter_{f,v}(L)$ | $L \cap PL(\lbrace d \in \mathcal{D} \mid d.f = v \rbrace)$ | Relational |
| $Score_q(L)$ | $(L, [s_1, \ldots, s_{|L|}])$ | Scoring |
| $\Phi(L_G)$ | $PL\bigl(\bigcup_i \phi_{G \rightarrow D}(G_i)\bigr)$ | Graph |
A hybrid text + vector search is simply an intersection:
Multi-signal fusion combines scores from different paradigms. Log-odds conjunction resolves the conjunction shrinkage problem where naive probability multiplication underestimates relevance:
| Function | Description |
|---|---|
fuse_log_odds(s1, s2, ..., gating => 'swish') | Log-odds conjunction with optional gating and gating_beta |
fuse_prob_and(s1, s2, ...) | $P = \prod P_i$ |
fuse_prob_or(s1, s2, ...) | $P = 1 - \prod(1 - P_i)$ |
fuse_prob_not(signal) | $P = 1 - P_{signal}$ |
fuse_attention(s1, s2, ..., normalized => true) | Attention-weighted fusion with logit normalization |
fuse_multihead(s1, s2, ..., n_heads => 4) | Multi-head attention fusion (Remark 8.6) |
fuse_learned(s1, s2, ..., alpha => 0.7) | Learned per-signal weights via Hebbian gradient |
pagerank(['graph']) / hits(['graph']) / betweenness(['graph']) | Graph centrality as FROM or WHERE signal (named graph optional) |
progressive_fusion(s1, s2, k, ...) | Cascading multi-stage WAND pruning |
weighted_rpq('expr', start, 'prop') | RPQ with cumulative edge weight tracking |
deep_fusion(layer(...), convolve(...), pool(...), ...) | Multi-layer neural network: ResNet + GNN + CNN + DenseNet |
Used inside deep_fusion() to compose neural network pipelines
over graph-structured data. Each layer type maps to a standard DL architecture:
| Function | Description |
|---|---|
layer(sig1, sig2, ...) | Signal layer: log-odds conjunction with residual connection (ResNet) |
propagate('label', 'agg'[, 'dir']) | Graph propagation: spread scores through edges (GNN) |
convolve('label', ARRAY[w...][, 'dir']) | Spatial convolution: weighted multi-hop BFS aggregation (CNN) |
pool('label', 'method', size[, 'dir']) | Spatial downsampling via greedy BFS partitioning |
dense(ARRAY[W], ARRAY[b], output_channels => N, input_channels => M) | Fully connected layer: out = W @ input + bias |
flatten() | Collapse spatial nodes into a single vector |
softmax() | Classification head (numerically stable) |
batch_norm([epsilon => 1e-5]) | Per-channel normalization across nodes |
dropout(p) | Inference-mode scaling by $(1 - p)$ |
model('name', $1) | Load trained model and create full inference pipeline |
embed(vector, in_channels => C, grid_h => H, grid_w => W) | Inject raw embedding vector into channel map |
Train CNN classifiers analytically (no backpropagation) and run inference via SQL aggregate and scalar functions:
| Function | Description |
|---|---|
deep_learn('model', label, embedding, 'edge_label', layers..., gating => 'relu', lambda => 1.0) | SELECT aggregate: analytical CNN training via ridge regression + random multi-channel conv |
deep_predict('model', embedding) | Per-row scalar: inference with trained model, returns class probabilities |
build_grid_graph('table', rows, cols, 'label') | FROM-clause: construct 4-connected grid graph for spatial convolution |
Layer functions available inside deep_learn() and deep_fusion():
| Layer | Description |
|---|---|
convolve(n_channels => N, init => 'kaiming') | Multi-channel spatial convolution. Init modes: kaiming (default), orthogonal, gabor, kmeans |
pool('max', 2) | Spatial downsampling (max or avg pooling) |
global_pool('avg'|'max'|'avg_max') | Channel-preserving spatial reduction: avg (per-channel mean), max, avg_max (both concatenated) |
attention(n_heads => 4, mode => 'random_qk') | Self-attention: context-dependent PoE (Section 8). Modes: content, random_qk, learned_v |
flatten(), dense(N), softmax() | Classification head: flatten spatial features, dense projection, softmax |
The join order optimizer uses the DPccp algorithm — dynamic programming over connected subgraph complement pairs. It finds the optimal bushy join tree in $O(3^n)$ time, compared to $O(n!)$ for exhaustive enumeration. For queries with more than 16 relations, it falls back to a greedy $O(n^3)$ heuristic.
Relation subsets encoded as integer bitmasks for O(1) hash lookup and set operations, avoiding frozenset overhead in the inner loop.
O(1) array-indexed connectivity check via bytearray(2^n)
lookup table, replacing hash-based set membership tests.
Connected subgraphs built bottom-up via BFS extension with
neighbor > min(S) invariant — each subgraph generated
exactly once.
Enumeration of (S1, S2) splits uses the sub | lowest_bit
trick to visit only the canonical half, halving the search space.
ANALYZEnp.allclose)PostgreSQL 17 compatible SQL with cross-paradigm extensions. 95 of 103 target PostgreSQL 17 features implemented.
SELECT title, _score FROM papers
WHERE fuse_attention(
text_match(title, 'attention'),
bayesian_knn_match(embedding, $1, 10, method => 'auto'),
traverse_match(1, 'cited_by', 2),
normalized => true
) AND year >= 2020
ORDER BY _score DESC;
SELECT rep, sale_date, amount,
SUM(amount) OVER (ORDER BY sale_date
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS running_total
FROM sales;
WITH RECURSIVE org_tree AS (
SELECT id, name, 1 AS depth FROM org_chart WHERE manager_id IS NULL
UNION ALL
SELECT o.id, o.name, t.depth + 1
FROM org_chart o INNER JOIN org_tree t ON o.manager_id = t.id
)
SELECT name, depth FROM org_tree ORDER BY depth;
— Create a named graph (persisted in SQLite)
SELECT * FROM create_graph('social');
— Populate the graph with Cypher CREATE
SELECT * FROM cypher('social', $$
CREATE (a:Person {name: 'Alice', age: 30})
CREATE (b:Person {name: 'Bob', age: 28})
CREATE (a)-[:KNOWS]->(b)
$$) AS (v agtype);
— Query the named graph
SELECT * FROM cypher('social', $$
MATCH (p:Person)-[:KNOWS]->(friend:Person)
WHERE p.age > 25
RETURN p.name, friend.name, p.age
ORDER BY p.name
$$) AS (name agtype, friend agtype, age agtype);
— Drop the graph and all its data
SELECT * FROM drop_graph('social');
Named graphs provide isolated graph namespaces with full SQLite persistence.
Each graph stores vertices and edges in dedicated tables
(_graph_vertices_*, _graph_edges_*), surviving
engine restarts. Graph centrality functions (pagerank,
hits, betweenness) and path indexes operate on
named graphs using direct graph names (e.g., traverse(1, 'knows', 2, 'social')).
— Query string mini-language: boolean, phrase, field targeting
SELECT title, _score FROM articles
WHERE title @@ 'database AND query' ORDER BY _score DESC;
— Hybrid text + vector fusion via @@
SELECT title, _score FROM articles
WHERE _all @@ 'body:search AND embedding:[0.1, 0.9, 0.0, 0.0]'
ORDER BY _score DESC;
CREATE TABLE restaurants (
id SERIAL PRIMARY KEY,
name TEXT NOT NULL,
location POINT
);
CREATE INDEX idx_loc ON restaurants USING rtree (location);
SELECT name, ROUND(ST_Distance(location, POINT(-73.9857, 40.7484)), 0) AS dist_m
FROM restaurants
WHERE spatial_within(location, POINT(-73.9857, 40.7484), 5000)
ORDER BY dist_m;
— Full CNN pipeline expressed as SQL
SELECT id, _score FROM patches
WHERE deep_fusion(
layer(knn_match(embedding, $1, 16)),
convolve('spatial', ARRAY[0.6, 0.4]),
pool('spatial', 'max', 2),
flatten(),
dense(ARRAY[...], ARRAY[...],
output_channels => 4, input_channels => 8),
softmax(),
gating => 'relu'
) ORDER BY _score DESC;
Each SQL function call maps to a neural network layer. The query optimizer produces an EXPLAIN plan that reads like a network architecture diagram. Signal layers are ResNet (residual connections), propagate layers are GNN (message passing), convolve layers are CNN (spatial aggregation), and the dense-softmax head is a standard classifier.
-- Train a CNN classifier analytically (no backpropagation)
SELECT deep_learn(
'mnist_cnn', label, embedding, 'spatial',
convolve(n_channels => 32),
pool('max', 2),
convolve(n_channels => 64),
pool('max', 2),
flatten(),
dense(output_channels => 10),
softmax(),
gating => 'relu', lambda => 1.0
) FROM mnist_train;
-- Inference: per-row scalar function
SELECT id, deep_predict('mnist_cnn', embedding) AS pred
FROM test_data;
-- Inference: via deep_fusion pipeline with learned weights
SELECT id, _score, class_probs FROM grid_28x28
WHERE deep_fusion(
model('mnist_cnn', $1),
gating => 'relu'
) ORDER BY _score DESC;
Training is analytical: multi-channel convolution kernels serve as Bayesian prior
(Kaiming, orthogonal, Gabor, or k-means initialization), ridge regression computes
the posterior. No gradient descent, no learning rate, no epochs.
global_pool() provides channel-preserving spatial reduction as an
alternative to flatten(). Self-attention layers inject global context
between conv+pool stages via context-dependent PoE (Paper 4, Section 8).
MNIST achieves 97.89% test accuracy.
— Edges valid at timestamp
SELECT * FROM temporal_traverse(1, 'knows', 2, 1700000000);
— GNN message passing: aggregate neighbor properties
SELECT * FROM papers
WHERE message_passing(2, 'mean', 'citations');
— Structural graph embeddings
SELECT * FROM papers
WHERE graph_embedding(16, 3);
SELECT title, _score FROM pagerank()
ORDER BY _score DESC;
— Named graph in WHERE signal and fusion
SELECT title, _score FROM papers
WHERE pagerank('social') ORDER BY _score DESC;
SELECT title, _score FROM papers
WHERE fuse_log_odds(
text_match(title, 'attention'),
pagerank('social'),
gating => 'relu'
) ORDER BY _score DESC;
CREATE SERVER warehouse FOREIGN DATA WRAPPER duckdb_fdw;
CREATE FOREIGN TABLE sales (
id INTEGER, name TEXT, amount INTEGER,
year INTEGER, month INTEGER
) SERVER warehouse OPTIONS (
source '/data/sales/**/*.parquet',
hive_partitioning 'true'
);
— DuckDB prunes Hive partitions at source
SELECT name, SUM(amount) FROM sales
WHERE year IN (2024, 2025) AND month > 6
GROUP BY name ORDER BY SUM(amount) DESC;
UQA federates external data sources through the FDW abstraction, bridging
pyarrow.Table results into the posting list pipeline.
In-process access to Parquet, CSV, JSON, and S3 objects. Hive partition discovery with predicate pushdown for partition pruning.
Remote access to any Arrow Flight SQL server — Dremio, DataFusion, DuckDB, and other compatible engines over gRPC.
Entire queries — JOINs, aggregates, window functions, subqueries — delegated to DuckDB or Arrow Flight SQL via AST deparsing. Zero Python row materialization.
Local dimension tables shipped to DuckDB as temp tables for in-process JOIN execution with large foreign fact tables.
Comparison (=, !=, <, >),
IN, LIKE, ILIKE, and BETWEEN
predicates are pushed down to the external engine, enabling Hive partition
pruning and reducing data transfer.
When all tables in a query live on the same DuckDB server, the entire SQL
— including JOINs, GROUP BY, ORDER BY, window functions, subqueries,
and CTEs — is deparsed via pglast.stream.RawStream and
executed natively by DuckDB. For mixed foreign-local queries, small local
tables (<100K rows) are shipped to DuckDB as temp tables. Arrow Flight
SQL supports same-server pure-foreign pushdown. If DuckDB cannot execute
(e.g., UQA-specific functions like spatial_within), the
standard UQA operator pipeline takes over automatically.
309 pytest-benchmark tests across 15 subsystems track performance across commits. The CI pipeline detects regressions with a 150% threshold and blocks PRs that exceed it.
Union, intersect, complement operations at 1K–100K entries. Two-pointer merge and boolean algebra.
SQLite document store and inverted index: insert, bulk load, lookup, and range scan throughput.
Parse + compile latency for SELECT, JOIN, CTE, window functions, and complex WHERE clauses.
Volcano physical operators: sequential scan, filter, project, hash aggregate, sort, and disk spilling.
DPccp enumeration for star and chain join graphs with 3–16 relations. Greedy fallback at 20–30 relations.
BM25, Bayesian BM25, vector scoring, and multi-signal fusion across varying corpus sizes.
BFS traversal, pattern matching (MRV + arc consistency), regular path queries on social-graph topologies.
PageRank, HITS, betweenness, bounded RPQ, weighted paths, subgraph index, incremental matching, and progressive fusion.
Graph lifecycle, vertex/edge insertion, scoped traversal, pattern matching, RPQ, graph algebra, and property indexes.
Full query lifecycle: parse, optimize, execute, and materialize for analytical and hybrid queries.