Overview
UQA is built on four interconnected theoretical papers that establish a mathematical foundation spanning database theory, information retrieval, probabilistic inference, and neural computation. Each paper extends the previous one, forming a chain of derivations from first principles.
graph LR
P1["Paper I
Unified Framework
(Dec 2023)"] --> P2["Paper II
Graph Extension
(Dec 2024)"]
P1 --> P3["Paper III
Bayesian BM25
(Jan 2026)"]
P3 --> P4["Paper IV
Neural Emergence
(Feb 2026)"]
P2 --> P4
I. Unified Framework
Posting lists as universal abstraction. Complete Boolean algebra over relational, text, and vector paradigms. Category-theoretic soundness.
II. Graph Extension
Graph-posting list isomorphism. Traversal, pattern matching, and regular path queries integrated into the Boolean algebra.
III. Bayesian BM25
BM25 scores transformed to calibrated probabilities via Bayesian inference. Sigmoid likelihood, composite prior, base rate calibration.
IV. Neural Emergence
Neural network structure emerges analytically from multi-signal Bayesian inference. Activation functions as probabilistic answers.
I. A Unified Mathematical Framework for Query Algebras
The foundational paper establishes that posting lists — sorted sequences of $(id, payload)$ pairs — can serve as a universal abstraction for representing result sets across relational, text retrieval, and vector search paradigms. This observation enables a single algebraic structure that preserves the expressivity of each paradigm while supporting cross-paradigm composition.
Posting Lists as Universal Abstraction
A posting list is an ordered sequence where document identifiers are sorted in ascending order:
A bijective mapping $PL: 2^{\mathcal{D}} \rightarrow \mathcal{L}$ converts document sets to posting lists and back. This bijection is the key insight: any operation on document sets has a corresponding operation on posting lists, and vice versa. The sorted-order invariant enables efficient two-pointer merge algorithms for all Boolean operations.
Complete Boolean Algebra
The structure $(\mathcal{L},\ \cup,\ \cap,\ \overline{\cdot},\ \emptyset,\ \mathcal{D})$ forms a complete Boolean algebra, satisfying:
- Commutativity: $A \cup B = B \cup A$ and $A \cap B = B \cap A$
- Associativity: $A \cup (B \cup C) = (A \cup B) \cup C$
- Distributivity: $A \cap (B \cup C) = (A \cap B) \cup (A \cap C)$
- Identity: $A \cup \emptyset = A$ and $A \cap PL(\mathcal{D}) = A$
- Complement: $A \cup \overline{A} = PL(\mathcal{D})$ and $A \cap \overline{A} = \emptyset$
Completeness means that for any family of posting lists $\{L_i\}_{i \in I}$, both $\bigvee_{i \in I} L_i$ and $\bigwedge_{i \in I} L_i$ exist. This is critical: it guarantees that arbitrary Boolean combinations of queries across different paradigms are algebraically well-defined, and that optimization can exploit lattice-theoretic rewrite rules (e.g., De Morgan's laws, absorption, distributivity) to restructure query plans.
Primitive Operators
Each data paradigm maps into the posting list space through primitive operators. Because every operator produces a posting list, they compose freely:
| Operator | Definition | Paradigm |
|---|---|---|
| $T(t)$ | $PL(\lbrace d \in \mathcal{D} \mid t \in term(d, f) \rbrace)$ | Text retrieval |
| $V_\theta(q)$ | $PL(\lbrace d \in \mathcal{D} \mid sim(vec(d, f), q) \geq \theta \rbrace)$ | Vector search |
| $KNN_k(q)$ | $PL(D_k)$ where $|D_k| = k$, ranked by similarity | Vector search |
| $Filter_{f,v}(L)$ | $L \cap PL(\lbrace d \in \mathcal{D} \mid d.f = v \rbrace)$ | Relational |
| $Score_q(L)$ | $(L, [s_1, \ldots, s_{|L|}])$ | Scoring |
A hybrid text + vector search is simply a Boolean intersection: $Hybrid_{t,q,\theta} = T(t) \cap V_\theta(q)$. More complex compositions — "documents matching query terms AND near a vector AND in category X" — are equally natural because all operators share the posting list type. The paper proves compositional completeness (Theorem 3.3.5): any query expressible as a combination of relational, text, and vector operations has a representation in the unified algebra.
Cross-Paradigm Join Operations
The framework extends posting lists to generalized posting lists with tuple-valued identifiers to support join operations:
Inner, left outer, right outer, full outer, cross, and similarity joins are all defined within this framework. The paper proves commutativity, associativity, and distributivity of join operations over Boolean operations — properties essential for join order optimization.
Category-Theoretic Foundation
The paper establishes a category $\mathcal{C}$ where objects are posting lists and morphisms are query operators. A functor $F: \mathcal{C} \rightarrow \mathbf{Set}$ maps posting lists to their document sets and operators to functions on sets. A query monad $(T, \eta, \mu)$ provides the formal foundation for operator composition and optimization. This machinery guarantees that optimization rewrites (e.g., filter pushdown, intersection reordering) are semantics-preserving transformations, not ad-hoc heuristics.
II. Extending the Framework to Graph Data Structures
The second paper incorporates graph data structures into the unified algebra. The key challenge is showing that graph operations — traversal, pattern matching, regular path queries — can be represented within the existing Boolean algebra without breaking its completeness properties.
Graph-Posting List Isomorphism
A graph posting list pairs identifiers with graph structures: $L_G = [(id_1, G_1), (id_2, G_2), \ldots, (id_n, G_n)]$. The isomorphism $\Phi: \mathcal{L}_G \rightarrow \mathcal{L}$ maps graph posting lists to standard posting lists via a bijective mapping $\phi_{G \rightarrow D}: \mathcal{G} \rightarrow 2^{\mathcal{D}}$ between graphs and document sets:
This isomorphism preserves Boolean operations: $\Phi(L_G^1 \cup_G L_G^2) = \Phi(L_G^1) \cup \Phi(L_G^2)$ and $\Phi(L_G^1 \cap_G L_G^2) = \Phi(L_G^1) \cap \Phi(L_G^2)$. The extended structure $(\mathcal{L}_G,\ \cup_G,\ \cap_G,\ \overline{\cdot}_G,\ \emptyset_G,\ \mathcal{G})$ forms a Boolean algebra isomorphic to the original. Graph operations thus integrate seamlessly with relational, text, and vector operations without requiring a separate query engine.
Graph-Specific Operations
Three operators extend the algebra to handle graph-specific queries:
- Graph Traversal: $Traverse_{v,l,k}(L_G)$ — returns all subgraphs reachable from vertex $v$ via edges with label $l$ up to $k$ hops. Cost: $O(\sum_{i=1}^{k} d^i)$ where $d$ is the average degree.
- Pattern Matching: $GMatch_P(L_G)$ — finds all subgraph isomorphisms matching pattern $P$. NP-complete in general; $O(|V_G|^{|V_P|})$ worst case.
- Vertex Aggregation: $VertexAgg_{p,agg}(L_G)$ — applies aggregation function $agg$ to property $p$ across all vertices.
Regular Path Queries
Regular path expressions extend graph traversal with the full power of regular languages. Given an expression $R$ over edge labels (supporting concatenation $R_1 \cdot R_2$, alternation $R_1 | R_2$, and Kleene star $R_1^*$), the operator $RPQ_R(L_G)$ returns all vertex pairs connected by paths matching $R$. The evaluation complexity is $O(|V_G|^2 \cdot |R|)$ using dynamic programming on the NFA structure.
The paper establishes an expressivity hierarchy: $\text{Relational} \subsetneq \text{Relational + Graph} \subsetneq \text{Relational + Graph + RPQ}$. Path reachability queries cannot be expressed in standard relational algebra with a fixed number of joins, so graph operations genuinely extend the algebra's expressive power.
Cross-Paradigm Integration
The graph extension defines integration operators for every paradigm combination:
- Graph-Relational: $ToGraph_{v_f, e_f}$ constructs graphs from relational data; $FromGraph_{v_f, e_f}$ flattens graphs back to relations.
- Graph-Vector: $VectorMatch_{P,q,\theta}$ finds subgraphs that both match a structural pattern $P$ and have vector similarity above $\theta$.
- Graph-Text: $TextToGraph$ constructs co-occurrence graphs from text; $SemanticGraphSearch_{q,\theta}$ finds subgraphs semantically related to a query.
A category-theoretic functor $F_G: \mathcal{C}_G \rightarrow \mathcal{C}$ preserves the algebraic structure, and a graph monad $(T_G, \eta_G, \mu_G)$ provides the composition semantics. The adjunction $T_G \dashv F_G$ formally connects graph operations to the base query algebra.
III. Bayesian BM25: A Probabilistic Framework for Hybrid Search
The third paper addresses a fundamental problem in hybrid search: BM25 scores are unbounded real numbers that cannot be directly combined with bounded signals like cosine similarity. Bayesian BM25 transforms BM25 scores into calibrated probabilities in $[0, 1]$, enabling principled multi-signal fusion.
The BM25 Score Interpretation Problem
BM25 scores are defined as:
These scores have range $[0, +\infty)$, vary with query length and term specificity, depend on corpus statistics, and cannot be meaningfully combined with bounded signals. The paper also demonstrates that Reciprocal Rank Fusion (RRF) — the most common alternative — suffers from information loss (discarding score magnitudes), rank sensitivity, non-commutativity with filtering, and an arbitrary constant $k=60$ with no theoretical justification.
Sigmoid Calibration via Bayesian Inference
The paper derives the sigmoid transformation from Bayes' theorem. Given a binary relevance variable $R \in \{0, 1\}$ and observed score $s$, the likelihood model $P(s \mid R = 1) = \sigma(\alpha(s - \beta))$ with the symmetric assumption $P(s \mid R = 0) = 1 - P(s \mid R = 1)$ yields the posterior:
where $L = \sigma(\alpha(s - \beta))$ is the likelihood and $p$ is the prior. The parameters $\alpha$ (steepness) and $\beta$ (midpoint) are estimated from the score distribution, requiring no relevance labels. A composite prior combines term frequency and document length signals to encode a document quality signal distinct from the BM25 relevance score — an empirical Bayes approach.
A key property is monotonicity preservation: for fixed prior $p$ and $\alpha > 0$, $s_1 > s_2 \implies P(R = 1 \mid s_1) > P(R = 1 \mid s_2)$. The Bayesian transformation preserves BM25's ranking while producing calibrated probability outputs.
Three-Term Posterior with Base Rate Prior
A corpus-level base rate $b_r$ captures the global probability that a random document is relevant to a random query. The full posterior decomposes into three additive terms in log-odds space:
The additivity in log-odds space is a consequence of sequential Bayes updates — the same property that the fourth paper identifies as the mathematical structure underlying neural network hidden layers. The base rate reduces expected calibration error by 68–77% without requiring relevance labels, and when $b_r = 0.5$ the term vanishes ($\text{logit}(0.5) = 0$), recovering the two-term posterior.
Hybrid Search Fusion
With all signals calibrated to probabilities, the paper defines Boolean combination operations:
- Probabilistic AND: $P(\text{AND}) = \prod_{i=1}^{n} p_i$ (product rule under independence)
- Probabilistic OR: $P(\text{OR}) = 1 - \prod_{i=1}^{n} (1 - p_i)$ (inclusion-exclusion)
- Probabilistic NOT: $P(\text{NOT}) = 1 - p$ (complement)
All operations are performed in log-space for numerical stability. The fourth paper identifies a fundamental problem with the product rule (AND) — conjunction shrinkage — and derives the log-odds conjunction as the solution.
WAND and Block-Max WAND Compatibility
The paper proves that the Bayesian transformation preserves safe document pruning in WAND (Weak AND) and BMW (Block-Max WAND) algorithms. The key insight is that the sigmoid is a monotonically increasing function, so an upper bound on the raw score implies an upper bound on the posterior probability. Modified upper bounds account for the prior's variability across documents, ensuring that no relevant document is incorrectly pruned.
IV. From Bayesian Inference to Neural Computation
The fourth paper demonstrates that feedforward neural network structure emerges analytically from multi-signal Bayesian inference over binary relevance judgments. Starting from the question "what is the probability that a document is relevant given multiple evidence signals?", the paper applies Bayes' theorem and arrives at a neural network — not by design, but as a mathematical consequence.
The Conjunction Shrinkage Problem
The product rule from Paper III has a fundamental deficiency when applied to evidence accumulation. For $n$ independent signals each reporting probability $p$:
The probability monotonically decreases with the number of confirming signals. This violates the intuition of evidence accumulation: agreement among independent sources should increase confidence, not decrease it. The product rule answers "what is the probability that all conditions are simultaneously satisfied?" but the question a search system poses is "how confident should we be given that multiple signals concur?" These are semantically distinct questions.
Log-Odds Conjunction
The resolution works in the log-odds domain — the natural parameter space of the Bernoulli exponential family. The log-odds mean aggregates evidence without conjunction shrinkage:
The log-odds mean is proven to be the exact normalized form of the Logarithmic Opinion Pool (Log-OP), also known as the Product of Experts (Hinton, 2002). It is scale-neutral: if $P_i = p$ for all $i$, then $\sigma(\bar{\ell}) = p$ regardless of $n$. A confidence scaling factor $n^{\alpha}$ (with $\alpha = 0.5$ encoding the classical $\sqrt{n}$ law) amplifies the magnitude while preserving sign:
| $P_{\text{text}}$ | $P_{\text{vec}}$ | Product Rule | Log-Odds Conjunction | Interpretation |
|---|---|---|---|---|
| 0.9 | 0.9 | 0.81 | 0.96 | Agreement amplified |
| 0.7 | 0.7 | 0.49 | 0.77 | Moderate agreement preserved |
| 0.7 | 0.3 | 0.21 | 0.50 | Exact neutrality (logits cancel) |
| 0.3 | 0.3 | 0.09 | 0.23 | Irrelevance preserved |
Emergence of Neural Network Structure
The full computation for multi-signal relevance estimation has three stages:
- Calibration: Each signal produces a probability $P_i \in (0, 1)$ via sigmoid calibration (BM25), linear calibration (cosine similarity), or external calibration (neural rankers).
- Log-Odds Aggregation: Probabilities are mapped to log-odds and aggregated as a weighted sum (normalized Log-OP).
- Final Posterior: The result passes through a sigmoid to return to probability space.
graph LR
S1["Signal 1
BM25 score"] -->|"sigmoid"| P1["P1"]
S2["Signal 2
cosine sim"] -->|"linear"| P2["P2"]
S3["Signal 3
graph score"] -->|"sigmoid"| P3["P3"]
P1 -->|"logit"| L1["l1"]
P2 -->|"logit"| L2["l2"]
P3 -->|"logit"| L3["l3"]
L1 --> SUM["Weighted Sum
1/n^(1-a)"]
L2 --> SUM
L3 --> SUM
SUM -->|"sigmoid"| OUT["Pfinal"]
The paper identifies this as a feedforward neural network: inputs are calibrated to probabilities (first layer), pass through the logit nonlinearity (hidden layer), undergo linear aggregation, and pass through a sigmoid activation (output layer). This identification is stated as a theorem:
When all signals share sigmoid calibration (e.g., multiple BM25 fields), the logit-sigmoid identity $\text{logit}(\sigma(x)) = x$ collapses the hidden layer, reducing the computation to logistic regression — the classical equivalence between Bayesian posterior computation and logistic regression, arrived at from a new direction. When signals have heterogeneous calibrations (the practically important case of hybrid search combining BM25 and cosine similarity), the logit performs a genuine nonlinear transformation, yielding a true two-layer network:
- Linear pathway for sigmoid-calibrated signals (logit-sigmoid cancellation)
- Nonlinear pathway for differently-calibrated signals (logit as genuine nonlinearity)
Activation Functions as Probabilistic Answers
The paper independently derives each major activation function from a complementary probabilistic question:
| Activation | Probabilistic Question | Derivation |
|---|---|---|
| Sigmoid $\sigma(x) = \frac{1}{1 + e^{-x}}$ | "How probable?" — posterior for binary relevance | Canonical link of Bernoulli exponential family. Uniquely satisfies: maps $\mathbb{R} \to (0,1)$, inverse of logit, self-referential derivative, symmetric evidence, maximum entropy. |
| ReLU $\max(0, x)$ | "How much?" — MAP estimate under sparse non-negative prior | Given signal $z = x + \epsilon$ with exponential prior $p(x) = \lambda e^{-\lambda x}$ for $x \geq 0$ and Gaussian noise, the MAP estimate is $\hat{x}_{MAP} = \max(0, z - \lambda\tau^2)$. With unit-absorbed parameters: $\text{ReLU}(z) = \max(0, z)$. |
| Swish $x \cdot \sigma(\beta x)$ | "What is the expected relevant amount?" — Bayesian posterior mean | Under the same sparse gating structure as ReLU, the posterior mean (instead of posterior mode) yields $\mathbb{E}[x \mid z] = z \cdot \sigma(\beta z)$. The MAP-to-Bayes duality of classical statistics manifests as the ReLU-to-Swish transition. |
| GELU $x \cdot \Phi(x)$ | Gaussian approximation of Swish | Replacing the logistic sigmoid with the probit function $\Phi(x)$ gives GELU. The well-known approximation $\Phi(x) \approx \sigma(1.702x)$ establishes GELU $\approx$ Swish$_{1.702}$. |
| Softmax | Multi-class posterior — categorical exponential family | Extension of the sigmoid to $K$-class classification via the categorical canonical link. |
The sigmoid's dual appearance — as individual signal calibrator and as final output activation — is a consequence of the exponential family structure of Bernoulli random variables. Any system processing binary evidence under the natural constraints of probability will arrive at the same function.
Attention as Logarithmic Opinion Pooling
Relaxing the assumption of uniform signal reliability — allowing weights to depend on the query-signal interaction — extends the derived feedforward structure to the attention mechanism. Attention is identified as a form of context-dependent Logarithmic Opinion Pooling, mathematically equivalent to Hinton's Product of Experts (PoE, 2002).
This provides a probabilistic answer to why attention computes a weighted sum: Product-of-Experts combination is multiplicative in probability space, which is additive in log-odds space. The weighted sum is not an arbitrary aggregation — it is the mathematically necessary operation for Log-OP combination.
The paper further proves that WAND-style token-level pruning and BMW-style head-level pruning are provably exact for attention with sigmoid-bounded value vectors — the first theoretically grounded framework for exact sparse attention.
Depth as Recursive Bayesian Inference
The derived inference unit is the atomic building block of arbitrarily deep networks. Depth arises from iterated marginalization over latent variables: each layer constructs the evidence required by the next. A deep network is a chain of recursive Bayesian inference:
Each layer constructs intermediate evidence that does not exist in the raw data but is needed for higher-level judgments. The correspondence between activation functions and probabilistic questions reframes architecture design as question sequencing — choosing the order of probabilistic questions posed to the data.
Connections Across Papers
The four papers form a coherent theoretical arc. Here are the key connections between them.
From Algebra to Probability
Paper I establishes posting lists as the universal abstraction with Boolean algebra. Paper III asks: what happens when we attach calibrated probabilities to posting list entries? The answer is Bayesian BM25 — and the fusion operations from Paper I (union, intersection) acquire probabilistic semantics (OR = $1 - \prod(1-p_i)$, AND = $\prod p_i$).
From Probability to Neural Structure
Paper III's product-rule AND has a defect (conjunction shrinkage). Paper IV resolves it through log-odds conjunction and discovers that the end-to-end computation is a neural network. The fix for a practical information retrieval problem turns out to be the mathematical structure that a separate scientific tradition calls a "neuron."
Graph Operations in the Probabilistic Framework
Paper II integrates graph traversal and pattern matching into Paper I's Boolean algebra via the graph-posting list isomorphism. In UQA's implementation, graph traversal operators assign a probability score (0.9) to reachable vertices, allowing graph signals to participate in the log-odds fusion framework of Paper IV alongside BM25 and vector signals.
Coverage-Based Fusion Defaults
A practical consequence of the theory: when a fusion signal has zero coverage (e.g., BM25 cannot find a query term in its vocabulary), the default probability should be 0.5 — neutral in log-odds space ($\text{logit}(0.5) = 0$). A fixed default of 0.01 encodes strong negative evidence ($\text{logit}(0.01) = -4.6$), causing extreme score drops. UQA's implementation uses a coverage-based interpolation:
At zero coverage, $p_{\text{default}} = 0.5$ (absence of evidence, not evidence of absence). At full coverage, $p_{\text{default}} = 0.01$ (a missing document is genuinely penalized).
Robertson's Completed Circle
In 1977, Robertson introduced the Probability Ranking Principle, opening a door between probability theory and information retrieval. BM25 was derived from this probabilistic foundation, yet its scores are not probabilities. For nearly fifty years, this circle remained unclosed.
Paper III (Bayesian BM25) closes the circle by returning BM25 scores to probability space. Paper IV reveals what lies on the other side: the computational structure that a separate scientific tradition, developed over different decades and for different purposes, calls a neural network.
The mathematics does not care what we call things. Whether we say "Bayesian posterior" or "sigmoid neuron," "sparse feature detector" or "ReLU unit," "evidence accumulation" or "attention" — the same structures appear wherever information is processed under uncertainty.