A multi-paradigm database engine that unifies relational, text retrieval, vector search, graph query, and geospatial paradigms under a single algebraic structure, using posting lists as the universal abstraction.
UQA is built on four interconnected theoretical papers that establish a mathematical foundation spanning database theory, information retrieval, probabilistic inference, and neural computation.
Sorted sequences of $(id, payload)$ pairs form a complete Boolean algebra. Any combination of AND, OR, NOT across paradigms is algebraically well-defined.
Graph traversals and pattern matches map to posting lists via a structure-preserving isomorphism, integrating seamlessly with text and vector operations.
BM25 scores transform to calibrated probabilities via Bayesian inference, enabling principled multi-signal fusion without ad-hoc normalization.
Neural network structure emerges analytically from multi-signal Bayesian inference. Activation functions are derived as answers to probabilistic questions.
graph TD
SQL["SQL Parser — pglast"] --> Compiler[SQL Compiler]
QB["QueryBuilder — Fluent API"] --> Operators
Compiler --> Optimizer[Query Optimizer]
Optimizer --> Operators[Operator Tree]
Operators --> Executor[Plan Executor]
Executor --> PAR["Parallel Executor — ThreadPool"]
Operators --> Cypher["Cypher Compiler — openCypher"]
PAR --> DS["Document Store — SQLite"]
PAR --> II["Inverted Index — SQLite + Analyzer"]
PAR --> VI["Vector Index — IVF"]
PAR --> GS["Graph Store — SQLite"]
subgraph Scoring
BM25[BM25]
BBFS[Bayesian BM25]
VS[Vector Scorer]
end
subgraph Fusion
LO[Log-Odds]
PB[Probabilistic Boolean]
end
Operators --> Scoring
Operators --> Fusion
PostgreSQL 17 compatible SQL parsed by pglast. Supports DDL, DML, JOINs, CTEs, window functions, subqueries, set operations.
Composable operators over posting lists. Boolean, primitive, scoring, fusion, graph, and aggregation operators chain freely.
Iterator-based physical engine with Apache Arrow columnar batches. External merge sort, Grace hash aggregation, disk spilling.
SQLite-backed persistence for documents, inverted indexes, vectors (IVF), and graph adjacency. Per-table isolation, ACID transactions.
Pluggable pipeline: CharFilter, Tokenizer, TokenFilter, Analyzer. Built-in standard tokenizer, lowercase, stop words, Porter stemming.
Independent branches (Union, Intersect, Fusion) run concurrently via ThreadPoolExecutor. Configurable worker count.
Isolated graph namespaces via create_graph() /
drop_graph() with SQLite persistence. PageRank, HITS,
betweenness centrality, and path indexes operate per named graph.
Each paradigm maps into the posting list space through primitive operators. Because every operator produces a posting list, they compose freely.
| Operator | Definition | Paradigm |
|---|---|---|
| $T(t)$ | $PL(\lbrace d \in \mathcal{D} \mid t \in term(d, f) \rbrace)$ | Text retrieval |
| $V_\theta(q)$ | $PL(\lbrace d \in \mathcal{D} \mid sim(vec(d, f), q) \geq \theta \rbrace)$ | Vector search |
| $KNN_k(q)$ | $PL(D_k)$ where $|D_k| = k$, ranked by similarity | Vector search |
| $Filter_{f,v}(L)$ | $L \cap PL(\lbrace d \in \mathcal{D} \mid d.f = v \rbrace)$ | Relational |
| $Score_q(L)$ | $(L, [s_1, \ldots, s_{|L|}])$ | Scoring |
| $\Phi(L_G)$ | $PL\bigl(\bigcup_i \phi_{G \rightarrow D}(G_i)\bigr)$ | Graph |
A hybrid text + vector search is simply an intersection:
Multi-signal fusion combines scores from different paradigms. Log-odds conjunction resolves the conjunction shrinkage problem where naive probability multiplication underestimates relevance:
| Function | Description |
|---|---|
fuse_log_odds(s1, s2, ...) | Log-odds conjunction — calibrated multi-signal fusion |
fuse_prob_and(s1, s2, ...) | $P = \prod P_i$ |
fuse_prob_or(s1, s2, ...) | $P = 1 - \prod(1 - P_i)$ |
fuse_prob_not(signal) | $P = 1 - P_{signal}$ |
fuse_log_odds(s1, s2, 'relu') | Log-odds with ReLU/Swish gating |
pagerank() / hits() / betweenness() | Graph centrality as FROM or WHERE signal |
progressive_fusion(s1, s2, k, ...) | Cascading multi-stage WAND pruning |
weighted_rpq('expr', start, 'prop') | RPQ with cumulative edge weight tracking |
The join order optimizer uses the DPccp algorithm — dynamic programming over connected subgraph complement pairs. It finds the optimal bushy join tree in $O(3^n)$ time, compared to $O(n!)$ for exhaustive enumeration. For queries with more than 16 relations, it falls back to a greedy $O(n^3)$ heuristic.
Relation subsets encoded as integer bitmasks for O(1) hash lookup and set operations, avoiding frozenset overhead in the inner loop.
O(1) array-indexed connectivity check via bytearray(2^n)
lookup table, replacing hash-based set membership tests.
Connected subgraphs built bottom-up via BFS extension with
neighbor > min(S) invariant — each subgraph generated
exactly once.
Enumeration of (S1, S2) splits uses the sub | lowest_bit
trick to visit only the canonical half, halving the search space.
ANALYZEnp.allclose)PostgreSQL 17 compatible SQL with cross-paradigm extensions. 95 of 103 target PostgreSQL 17 features implemented.
SELECT title, _score FROM papers
WHERE fuse_log_odds(
text_match(title, 'attention'),
knn_match(embedding, ARRAY[0.1, 0.2, ...], 5),
traverse_match(1, 'cited_by', 2)
) AND year >= 2020
ORDER BY _score DESC;
SELECT rep, sale_date, amount,
SUM(amount) OVER (ORDER BY sale_date
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS running_total
FROM sales;
WITH RECURSIVE org_tree AS (
SELECT id, name, 1 AS depth FROM org_chart WHERE manager_id IS NULL
UNION ALL
SELECT o.id, o.name, t.depth + 1
FROM org_chart o INNER JOIN org_tree t ON o.manager_id = t.id
)
SELECT name, depth FROM org_tree ORDER BY depth;
— Create a named graph (persisted in SQLite)
SELECT * FROM create_graph('social');
— Populate the graph with Cypher CREATE
SELECT * FROM cypher('social', $$
CREATE (a:Person {name: 'Alice', age: 30})
CREATE (b:Person {name: 'Bob', age: 28})
CREATE (a)-[:KNOWS]->(b)
$$) AS (v agtype);
— Query the named graph
SELECT * FROM cypher('social', $$
MATCH (p:Person)-[:KNOWS]->(friend:Person)
WHERE p.age > 25
RETURN p.name, friend.name, p.age
ORDER BY p.name
$$) AS (name agtype, friend agtype, age agtype);
— Drop the graph and all its data
SELECT * FROM drop_graph('social');
Named graphs provide isolated graph namespaces with full SQLite persistence.
Each graph stores vertices and edges in dedicated tables
(_graph_vertices_*, _graph_edges_*), surviving
engine restarts. Graph centrality functions (pagerank,
hits, betweenness) and path indexes operate on
named graphs via the 'graph:name' syntax.
— Query string mini-language: boolean, phrase, field targeting
SELECT title, _score FROM articles
WHERE title @@ 'database AND query' ORDER BY _score DESC;
— Hybrid text + vector fusion via @@
SELECT title, _score FROM articles
WHERE _all @@ 'body:search AND embedding:[0.1, 0.9, 0.0, 0.0]'
ORDER BY _score DESC;
CREATE TABLE restaurants (
id SERIAL PRIMARY KEY,
name TEXT NOT NULL,
location POINT
);
CREATE INDEX idx_loc ON restaurants USING rtree (location);
SELECT name, ROUND(ST_Distance(location, POINT(-73.9857, 40.7484)), 0) AS dist_m
FROM restaurants
WHERE spatial_within(location, POINT(-73.9857, 40.7484), 5000)
ORDER BY dist_m;
— Edges valid at timestamp
SELECT * FROM temporal_traverse(1, 'knows', 2, 1700000000);
— GNN message passing: aggregate neighbor properties
SELECT * FROM papers
WHERE message_passing(2, 'mean', 'citations');
— Structural graph embeddings
SELECT * FROM papers
WHERE graph_embedding(16, 3);
SELECT title, _score FROM pagerank()
ORDER BY _score DESC;
SELECT title, _score FROM papers
WHERE fuse_log_odds(
text_match(title, 'attention'),
pagerank(),
'relu'
) ORDER BY _score DESC;
CREATE SERVER warehouse FOREIGN DATA WRAPPER duckdb_fdw;
CREATE FOREIGN TABLE sales (
id INTEGER, name TEXT, amount INTEGER,
year INTEGER, month INTEGER
) SERVER warehouse OPTIONS (
source '/data/sales/**/*.parquet',
hive_partitioning 'true'
);
— DuckDB prunes Hive partitions at source
SELECT name, SUM(amount) FROM sales
WHERE year IN (2024, 2025) AND month > 6
GROUP BY name ORDER BY SUM(amount) DESC;
UQA federates external data sources through the FDW abstraction, bridging
pyarrow.Table results into the posting list pipeline.
In-process access to Parquet, CSV, JSON, and S3 objects. Hive partition discovery with predicate pushdown for partition pruning.
Remote access to any Arrow Flight SQL server — Dremio, DataFusion, DuckDB, and other compatible engines over gRPC.
Entire queries — JOINs, aggregates, window functions, subqueries — delegated to DuckDB or Arrow Flight SQL via AST deparsing. Zero Python row materialization.
Local dimension tables shipped to DuckDB as temp tables for in-process JOIN execution with large foreign fact tables.
Comparison (=, !=, <, >),
IN, LIKE, ILIKE, and BETWEEN
predicates are pushed down to the external engine, enabling Hive partition
pruning and reducing data transfer.
When all tables in a query live on the same DuckDB server, the entire SQL
— including JOINs, GROUP BY, ORDER BY, window functions, subqueries,
and CTEs — is deparsed via pglast.stream.RawStream and
executed natively by DuckDB. For mixed foreign-local queries, small local
tables (<100K rows) are shipped to DuckDB as temp tables. Arrow Flight
SQL supports same-server pure-foreign pushdown. If DuckDB cannot execute
(e.g., UQA-specific functions like spatial_within), the
standard UQA operator pipeline takes over automatically.
295 pytest-benchmark tests across 14 subsystems track performance across commits. The CI pipeline detects regressions with a 150% threshold and blocks PRs that exceed it.
Union, intersect, complement operations at 1K–100K entries. Two-pointer merge and boolean algebra.
SQLite document store and inverted index: insert, bulk load, lookup, and range scan throughput.
Parse + compile latency for SELECT, JOIN, CTE, window functions, and complex WHERE clauses.
Volcano physical operators: sequential scan, filter, project, hash aggregate, sort, and disk spilling.
DPccp enumeration for star and chain join graphs with 3–16 relations. Greedy fallback at 20–30 relations.
BM25, Bayesian BM25, vector scoring, and multi-signal fusion across varying corpus sizes.
BFS traversal, pattern matching (MRV + arc consistency), regular path queries on social-graph topologies.
PageRank, HITS, betweenness, bounded RPQ, weighted paths, subgraph index, incremental matching, and progressive fusion.
Full query lifecycle: parse, optimize, execute, and materialize for analytical and hybrid queries.