Converts raw BM25 retrieval scores into calibrated relevance probabilities using Bayesian inference. Probabilistic fusion for hybrid text and vector search.
Standard BM25 produces unbounded scores that lack consistent meaning across queries. A score of 8.5 might be highly relevant for one query but mediocre for another, making threshold-based filtering and multi-signal fusion unreliable. Bayesian BM25 solves this by transforming scores into calibrated probabilities in $[0, 1]$.
The likelihood of relevance given a BM25 score follows a sigmoid model parameterized by steepness $\alpha$ and midpoint $\beta$:
This maps any real-valued BM25 score $s$ into $(0, 1)$, capturing the intuition that relevance probability increases with score but saturates at the extremes.
A document's prior relevance probability combines two information-theoretic signals: term frequency (higher TF indicates richer topical content) and document length ratio (documents near the corpus average length are more likely to be relevant):
where $\pi_{tf} = \sigma(\log(1 + tf))$ and $\pi_{len} = 0.5 + 0.5 \cdot (1 - |r - 1|)$ with $r$ being the document length ratio. The composite prior is multiplicative in probability space, which becomes additive in log-odds space.
The final posterior combines likelihood, prior, and a corpus-level base rate $b_r$ using Bayes' theorem. In log-odds space, the three terms decompose additively:
This three-term decomposition separates the score-dependent signal (likelihood) from document-specific context (prior) and corpus-level prevalence (base rate), enabling each component to be estimated independently. The base rate alone reduces Expected Calibration Error by 68–77% without requiring any relevance labels.
Naive probability multiplication ($P = \prod P_i$) suffers from conjunction shrinkage: combining multiple agreeing high-probability signals produces a low fused probability. The log-odds conjunction resolves this by averaging in logit space with confidence scaling:
The $n^{\alpha}$ factor (with $\alpha = 0.5$ giving $\sqrt{n}$ scaling) compensates for the averaging effect so that $n$ perfectly agreeing signals return the consensus probability, not a diluted version. With per-signal weights $w_i$, this becomes the Log-OP formulation $\sigma\!\bigl(n^{\alpha} \sum w_i \cdot \text{logit}(P_i)\bigr)$, where $\alpha$ and weights compose multiplicatively.
graph LR
S["Raw BM25 Score"] --> SIG["Sigmoid Likelihood
L = sigma(alpha * (s - beta))"]
SIG --> POST["Bayesian Posterior
logit(P) = logit(L) + logit(pi) + logit(br)"]
TF["Term Frequency"] --> PRIOR["Composite Prior
pi = pi_tf * pi_len"]
DL["Doc Length Ratio"] --> PRIOR
PRIOR --> POST
BR["Base Rate"] --> POST
POST --> PROB["Calibrated
Probability"]
PROB --> FUSE["Fusion
Log-Odds / AND / OR"]
VEC["Vector Score"] --> COS["cosine_to_probability()"]
VEC --> VPT["VectorProbabilityTransform
P = sigmoid(log(f_R/f_G) + logit(base))"]
ANN["ANN Index
(IVF / HNSW)"] -.->|"local sample"| VPT
COS --> FUSE
VPT --> FUSE
FUSE --> RANK["Final
Ranking"]
Sigmoid likelihood converts unbounded BM25 scores to (0, 1) probabilities with learnable steepness and midpoint parameters.
Term frequency and document length signals provide document-specific relevance context independent of the query score.
Corpus-level relevance prevalence estimated from the score distribution (percentile, mixture model, or elbow detection). No labels needed.
Batch gradient descent or online SGD with EMA-smoothed gradients and Polyak averaging. Three training modes: C1, C2, C3.
Log-odds conjunction, probabilistic AND/OR/NOT, per-signal weights, sparse gating (ReLU/Swish/GELU/Softplus), multi-head attention, and neural score calibration.
Likelihood ratio framework replaces $\frac{1+\cos\theta}{2}$ with
density ratio estimation (KDE/GMM). calibrate_with_sample()
decouples eval from density sample for index-aware ANN calibration.
Safe Bayesian probability upper bounds for document pruning in top-k retrieval. Block-max (BMW) bounds for tighter per-block pruning. Attention pruning for fusion.
All fusion operations work in probability space with well-defined semantics. Because every signal is a calibrated probability, they compose freely.
| Function | Description | Use Case |
|---|---|---|
log_odds_conjunction() |
$\sigma\!\bigl(n^{\alpha}\sum w_i \cdot \text{logit}(P_i)\bigr)$ — agreement-aware fusion | Primary multi-signal fusion |
balanced_log_odds_fusion() |
Min-max normalize logits, then combine — equalizes signal scales | Hybrid BM25 + dense search |
prob_and() |
$P = \prod P_i$ — product rule in log-space | Strict conjunction (all must match) |
prob_or() |
$P = 1 - \prod(1 - P_i)$ — complement rule in log-space | Disjunction (any may match) |
prob_not() |
$P = 1 - P_i$ — complement | Exclusion queries |
cosine_to_probability() |
$(1 + \cos\theta) / 2$ with epsilon clamping | Convert vector similarity for fusion |
VectorProbabilityTransform |
$\sigma(\log(f_R / f_G) + \text{logit}(P_{base}))$ — likelihood ratio calibration | Calibrated vector similarity via density estimation (Paper 3) |
LearnableLogOddsWeights learns per-signal reliability from
labeled data via a Hebbian gradient that is backprop-free.
Starting from the Naive Bayes uniform initialization ($w_i = 1/n$),
the gradient $\nabla_{z_j} = n^{\alpha}(p - y) \cdot w_j(x_j - \bar{x}_w)$
adjusts weights based on pre-synaptic activity times post-synaptic error.
AttentionLogOddsWeights replaces static weights with
query-dependent attention. A linear projection from query
features to softmax attention weights allows the fusion to adapt per query:
some queries benefit more from lexical signals, others from semantic similarity.
Optional per-signal logit normalization (normalize=True) equalizes
signal scales before the weighted sum.
For high-dimensional signal spaces, ReLU gating
($\max(0, \text{logit})$, MAP estimation), Swish gating
($\text{logit} \cdot \sigma(\text{logit})$, Bayes estimation),
and GELU gating
($\text{logit} \cdot \sigma(1.702 \cdot \text{logit})$, Gaussian noise model)
suppress noisy negative-logit signals before aggregation
(Paper 2, Theorems 6.5.3/6.7.4/6.8.1).
Softplus gating
($\log(1 + e^{\beta \cdot \text{logit}}) / \beta$, Remark 6.5.4)
is a smooth ReLU that never zeroes out evidence, making it suitable for
small datasets where discarding any signal is costly.
The generalized Swish gate
$\text{logit} \cdot \sigma(\beta \cdot \text{logit})$
interpolates between $x/2$ ($\beta \to 0$), standard Swish ($\beta = 1$),
and ReLU ($\beta \to \infty$) via the gating_beta parameter
(Theorem 6.7.6).
MultiHeadAttentionLogOddsWeights creates multiple independent
attention heads with different random initializations. Each head produces
fused log-odds independently, then the results are averaged in log-odds space
before converting back to probability via sigmoid (Remark 8.6). Multi-head
diversity reduces variance compared to single-head attention.
Both single-head and multi-head attention support exact pruning via
compute_upper_bounds() and prune() (Theorem 8.7.1).
PlattCalibrator (sigmoid: $P = \sigma(a \cdot s + b)$) and
IsotonicCalibrator (PAVA monotone regression) convert raw neural
model scores into calibrated probabilities suitable for Bayesian fusion.
Calibrated scores can be combined with BM25 probabilities via
log_odds_conjunction.
VectorProbabilityTransform replaces the naive
$\frac{1+\cos\theta}{2}$ conversion with a likelihood ratio framework (Paper 3):
$P(R|d) = \sigma(\log(f_R(d) / f_G(d)) + \text{logit}(P_{base}))$,
where $f_R$ is estimated via weighted KDE or GMM-EM, and $f_G$ is the
background Gaussian. Auto-routing selects KDE (gap detected, $K \geq 50$)
or GMM (small $K$ or smooth distributions).
calibrate_with_sample() decouples density estimation from
evaluation: the local ANN neighborhood (e.g. IVF probed cells) provides
the training sample for $f_R$, while probabilities are produced for an
arbitrary evaluation set. This is the index-aware calibration path where
the density landscape comes from one set of distances and the output
probabilities are needed at different points.
Standalone helpers ivf_density_prior() and
knn_density_prior() provide optional density priors from
IVF cell populations or HNSW neighbor distances.
Evaluated on 5 BEIR datasets using the retrieve-then-evaluate protocol (top-1000 per signal, union candidates, pytrec_eval). Dense encoder: all-MiniLM-L6-v2. BM25: k1=1.2, b=0.75, Lucene variant with Snowball English stemmer.
| Method | ArguAna | FiQA | NFCorpus | SciDocs | SciFact | Average |
|---|---|---|---|---|---|---|
| BM25 | 36.16 | 25.32 | 31.85 | 15.65 | 67.91 | 35.38 |
| Dense | 36.98 | 36.87 | 31.59 | 21.64 | 64.51 | 38.32 |
| Convex | 40.03 | 37.10 | 35.61 | 19.65 | 73.38 | 41.15 |
| RRF | 39.61 | 36.85 | 34.43 | 20.09 | 71.43 | 40.48 |
| Bayesian-Balanced | 37.27 | 40.59 | 35.73 | 21.40 | 72.47 | 41.50 |
| Bayesian-Attn-Norm | 37.21 | 40.43 | 35.43 | 21.91 | 73.22 | 41.64 |
| Bayesian-Vector-Balanced | 37.53 | 40.02 | 35.13 | 21.44 | 70.24 | 40.87 |
| Bayesian-MultiHead-Norm | 37.13 | 39.08 | 35.72 | 21.78 | 70.60 | 40.86 |
| Method | Type | Delta |
|---|---|---|
| Bayesian-Attn-Norm | zero-shot | +6.28 |
| Bayesian-Balanced | zero-shot | +6.12 |
| Convex | zero-shot | +5.78 |
| Bayesian-Vector-Attn | zero-shot | +5.60 |
| Bayesian-Vector-Balanced | zero-shot | +5.49 |
| Bayesian-MultiHead-Norm | zero-shot | +5.48 |
| RRF | zero-shot | +5.10 |
| Bayesian-MultiHead | zero-shot | +5.08 |
| Bayesian-Attention | zero-shot | +4.99 |
| Dense | zero-shot | +2.94 |
Bayesian-Attn-Norm achieves the highest average NDCG@10 (41.67%), outperforming Convex combination (+0.52), RRF (+1.18), and BM25 (+6.28) in a fully zero-shot setting with no relevance labels. Bayesian-Vector-Balanced (+5.49) uses likelihood ratio calibration (Paper 3) for the dense signal, competitive with Convex (+5.78). See the full benchmark tables with 26 methods, MAP@10, and Recall@10.
| Method | NFCorpus ECE | SciFact ECE |
|---|---|---|
| Bayesian (no base rate) | 0.6519 | 0.7989 |
| Bayesian (base_rate=auto) | 0.1461 (-77.6%) | 0.2577 (-67.7%) |
| Batch fit + base_rate=auto | 0.0085 (-98.7%) | 0.0021 (-99.7%) |
| Platt scaling | 0.0186 (-97.1%) | 0.0188 (-97.7%) |
The base rate prior alone reduces ECE by 68–77% without any labeled data. With batch fitting, calibration error drops below 1%.
# Zero-shot (26 methods, exact dense retrieval)
python benchmarks/hybrid_beir.py -d <beir-data-dir>
# With IVF dense backend (index-aware calibration)
python benchmarks/hybrid_beir.py -d <beir-data-dir> --dense-backend ivf
# With tuning (supervised + grid search)
python benchmarks/hybrid_beir.py -d <beir-data-dir> --tune
# Download BEIR datasets automatically
python benchmarks/hybrid_beir.py -d <beir-data-dir> --download
Included as a baseline retrieval model (bb25) for the
Massive Text Embedding Benchmark.
Used for BM25 score normalization in hybrid search
(normalize="bayesian-bm25").
Adopted as an official sample application.
Scoring operator for probabilistic text retrieval and multi-signal fusion in the unified query algebra.
@preprint{Jeong2026BayesianBM25,
author = {Jeong, Jaepil},
title = {Bayesian {BM25}: {A} Probabilistic Framework for Hybrid Text
and Vector Search},
year = {2026},
publisher = {Zenodo},
doi = {10.5281/zenodo.18414940},
url = {https://doi.org/10.5281/zenodo.18414940}
}
@preprint{Jeong2026BayesianNeural,
author = {Jeong, Jaepil},
title = {From {Bayesian} Inference to Neural Computation: The Analytical
Emergence of Neural Network Structure from Probabilistic
Relevance Estimation},
year = {2026},
publisher = {Zenodo},
doi = {10.5281/zenodo.18512411},
url = {https://doi.org/10.5281/zenodo.18512411}
}
@preprint{Jeong2026VectorCalibration,
author = {Jeong, Jaepil},
title = {Vector Scores as Likelihood Ratios: Index-Derived {Bayesian}
Calibration for Hybrid Search},
year = {2026}
}