Bayesian BM25

Converts raw BM25 retrieval scores into calibrated relevance probabilities using Bayesian inference. Probabilistic fusion for hybrid text and vector search.

GitHub PyPI Blog Papers

Core Theory

The Problem with Raw BM25 Scores

Standard BM25 produces unbounded scores that lack consistent meaning across queries. A score of 8.5 might be highly relevant for one query but mediocre for another, making threshold-based filtering and multi-signal fusion unreliable. Bayesian BM25 solves this by transforming scores into calibrated probabilities in $[0, 1]$.

Sigmoid Likelihood

The likelihood of relevance given a BM25 score follows a sigmoid model parameterized by steepness $\alpha$ and midpoint $\beta$:

$$L(R \mid s) = \sigma(\alpha \cdot (s - \beta)) = \frac{1}{1 + e^{-\alpha(s - \beta)}}$$

This maps any real-valued BM25 score $s$ into $(0, 1)$, capturing the intuition that relevance probability increases with score but saturates at the extremes.

Composite Prior

A document's prior relevance probability combines two information-theoretic signals: term frequency (higher TF indicates richer topical content) and document length ratio (documents near the corpus average length are more likely to be relevant):

$$\pi(d) = \pi_{tf}(d) \cdot \pi_{len}(d)$$

where $\pi_{tf} = \sigma(\log(1 + tf))$ and $\pi_{len} = 0.5 + 0.5 \cdot (1 - |r - 1|)$ with $r$ being the document length ratio. The composite prior is multiplicative in probability space, which becomes additive in log-odds space.

Bayesian Posterior

The final posterior combines likelihood, prior, and a corpus-level base rate $b_r$ using Bayes' theorem. In log-odds space, the three terms decompose additively:

$$\text{logit}(P) = \text{logit}(L) + \text{logit}(\pi) + \text{logit}(b_r)$$

This three-term decomposition separates the score-dependent signal (likelihood) from document-specific context (prior) and corpus-level prevalence (base rate), enabling each component to be estimated independently. The base rate alone reduces Expected Calibration Error by 68–77% without requiring any relevance labels.

Log-Odds Conjunction

Naive probability multiplication ($P = \prod P_i$) suffers from conjunction shrinkage: combining multiple agreeing high-probability signals produces a low fused probability. The log-odds conjunction resolves this by averaging in logit space with confidence scaling:

$$P_{fused} = \sigma\!\left(n^{\alpha} \cdot \frac{1}{n}\sum_{i=1}^{n} \text{logit}(P_i)\right)$$

The $n^{\alpha}$ factor (with $\alpha = 0.5$ giving $\sqrt{n}$ scaling) compensates for the averaging effect so that $n$ perfectly agreeing signals return the consensus probability, not a diluted version. With per-signal weights $w_i$, this becomes the Log-OP formulation $\sigma\!\bigl(n^{\alpha} \sum w_i \cdot \text{logit}(P_i)\bigr)$, where $\alpha$ and weights compose multiplicatively.

Pipeline

graph LR
    S["Raw BM25 Score"] --> SIG["Sigmoid Likelihood
L = sigma(alpha * (s - beta))"] SIG --> POST["Bayesian Posterior
logit(P) = logit(L) + logit(pi) + logit(br)"] TF["Term Frequency"] --> PRIOR["Composite Prior
pi = pi_tf * pi_len"] DL["Doc Length Ratio"] --> PRIOR PRIOR --> POST BR["Base Rate"] --> POST POST --> PROB["Calibrated
Probability"] PROB --> FUSE["Fusion
Log-Odds / AND / OR"] VEC["Vector Score"] --> COS["cosine_to_probability()"] VEC --> VPT["VectorProbabilityTransform
P = sigmoid(log(f_R/f_G) + logit(base))"] ANN["ANN Index
(IVF / HNSW)"] -.->|"local sample"| VPT COS --> FUSE VPT --> FUSE FUSE --> RANK["Final
Ranking"]

Score Transform

Sigmoid likelihood converts unbounded BM25 scores to (0, 1) probabilities with learnable steepness and midpoint parameters.

Composite Prior

Term frequency and document length signals provide document-specific relevance context independent of the query score.

Base Rate

Corpus-level relevance prevalence estimated from the score distribution (percentile, mixture model, or elbow detection). No labels needed.

Parameter Learning

Batch gradient descent or online SGD with EMA-smoothed gradients and Polyak averaging. Three training modes: C1, C2, C3.

Probabilistic Fusion

Log-odds conjunction, probabilistic AND/OR/NOT, per-signal weights, sparse gating (ReLU/Swish/GELU/Softplus), multi-head attention, and neural score calibration.

Vector Calibration

Likelihood ratio framework replaces $\frac{1+\cos\theta}{2}$ with density ratio estimation (KDE/GMM). calibrate_with_sample() decouples eval from density sample for index-aware ANN calibration.

WAND Pruning

Safe Bayesian probability upper bounds for document pruning in top-k retrieval. Block-max (BMW) bounds for tighter per-block pruning. Attention pruning for fusion.

Multi-Signal Fusion

All fusion operations work in probability space with well-defined semantics. Because every signal is a calibrated probability, they compose freely.

FunctionDescriptionUse Case
log_odds_conjunction() $\sigma\!\bigl(n^{\alpha}\sum w_i \cdot \text{logit}(P_i)\bigr)$ — agreement-aware fusion Primary multi-signal fusion
balanced_log_odds_fusion() Min-max normalize logits, then combine — equalizes signal scales Hybrid BM25 + dense search
prob_and() $P = \prod P_i$ — product rule in log-space Strict conjunction (all must match)
prob_or() $P = 1 - \prod(1 - P_i)$ — complement rule in log-space Disjunction (any may match)
prob_not() $P = 1 - P_i$ — complement Exclusion queries
cosine_to_probability() $(1 + \cos\theta) / 2$ with epsilon clamping Convert vector similarity for fusion
VectorProbabilityTransform $\sigma(\log(f_R / f_G) + \text{logit}(P_{base}))$ — likelihood ratio calibration Calibrated vector similarity via density estimation (Paper 3)

Learnable Weights

LearnableLogOddsWeights learns per-signal reliability from labeled data via a Hebbian gradient that is backprop-free. Starting from the Naive Bayes uniform initialization ($w_i = 1/n$), the gradient $\nabla_{z_j} = n^{\alpha}(p - y) \cdot w_j(x_j - \bar{x}_w)$ adjusts weights based on pre-synaptic activity times post-synaptic error.

Attention-Based Fusion

AttentionLogOddsWeights replaces static weights with query-dependent attention. A linear projection from query features to softmax attention weights allows the fusion to adapt per query: some queries benefit more from lexical signals, others from semantic similarity. Optional per-signal logit normalization (normalize=True) equalizes signal scales before the weighted sum.

Sparse Signal Gating

For high-dimensional signal spaces, ReLU gating ($\max(0, \text{logit})$, MAP estimation), Swish gating ($\text{logit} \cdot \sigma(\text{logit})$, Bayes estimation), and GELU gating ($\text{logit} \cdot \sigma(1.702 \cdot \text{logit})$, Gaussian noise model) suppress noisy negative-logit signals before aggregation (Paper 2, Theorems 6.5.3/6.7.4/6.8.1). Softplus gating ($\log(1 + e^{\beta \cdot \text{logit}}) / \beta$, Remark 6.5.4) is a smooth ReLU that never zeroes out evidence, making it suitable for small datasets where discarding any signal is costly. The generalized Swish gate $\text{logit} \cdot \sigma(\beta \cdot \text{logit})$ interpolates between $x/2$ ($\beta \to 0$), standard Swish ($\beta = 1$), and ReLU ($\beta \to \infty$) via the gating_beta parameter (Theorem 6.7.6).

Multi-Head Attention

MultiHeadAttentionLogOddsWeights creates multiple independent attention heads with different random initializations. Each head produces fused log-odds independently, then the results are averaged in log-odds space before converting back to probability via sigmoid (Remark 8.6). Multi-head diversity reduces variance compared to single-head attention. Both single-head and multi-head attention support exact pruning via compute_upper_bounds() and prune() (Theorem 8.7.1).

Neural Score Calibration

PlattCalibrator (sigmoid: $P = \sigma(a \cdot s + b)$) and IsotonicCalibrator (PAVA monotone regression) convert raw neural model scores into calibrated probabilities suitable for Bayesian fusion. Calibrated scores can be combined with BM25 probabilities via log_odds_conjunction.

Vector Similarity Calibration

VectorProbabilityTransform replaces the naive $\frac{1+\cos\theta}{2}$ conversion with a likelihood ratio framework (Paper 3): $P(R|d) = \sigma(\log(f_R(d) / f_G(d)) + \text{logit}(P_{base}))$, where $f_R$ is estimated via weighted KDE or GMM-EM, and $f_G$ is the background Gaussian. Auto-routing selects KDE (gap detected, $K \geq 50$) or GMM (small $K$ or smooth distributions).

calibrate_with_sample() decouples density estimation from evaluation: the local ANN neighborhood (e.g. IVF probed cells) provides the training sample for $f_R$, while probabilities are produced for an arbitrary evaluation set. This is the index-aware calibration path where the density landscape comes from one set of distances and the output probabilities are needed at different points. Standalone helpers ivf_density_prior() and knn_density_prior() provide optional density priors from IVF cell populations or HNSW neighbor distances.

Benchmarks

Evaluated on 5 BEIR datasets using the retrieve-then-evaluate protocol (top-1000 per signal, union candidates, pytrec_eval). Dense encoder: all-MiniLM-L6-v2. BM25: k1=1.2, b=0.75, Lucene variant with Snowball English stemmer.

NDCG@10 (Zero-Shot)

MethodArguAnaFiQANFCorpusSciDocsSciFactAverage
BM2536.1625.3231.8515.6567.9135.38
Dense36.9836.8731.5921.6464.5138.32
Convex40.0337.1035.6119.6573.3841.15
RRF39.6136.8534.4320.0971.4340.48
Bayesian-Balanced37.2740.5935.7321.4072.4741.50
Bayesian-Attn-Norm37.2140.4335.4321.9173.2241.64
Bayesian-Vector-Balanced37.5340.0235.1321.4470.2440.87
Bayesian-MultiHead-Norm37.1339.0835.7221.7870.6040.86

Delta vs BM25 (NDCG@10)

MethodTypeDelta
Bayesian-Attn-Normzero-shot+6.28
Bayesian-Balancedzero-shot+6.12
Convexzero-shot+5.78
Bayesian-Vector-Attnzero-shot+5.60
Bayesian-Vector-Balancedzero-shot+5.49
Bayesian-MultiHead-Normzero-shot+5.48
RRFzero-shot+5.10
Bayesian-MultiHeadzero-shot+5.08
Bayesian-Attentionzero-shot+4.99
Densezero-shot+2.94

Bayesian-Attn-Norm achieves the highest average NDCG@10 (41.67%), outperforming Convex combination (+0.52), RRF (+1.18), and BM25 (+6.28) in a fully zero-shot setting with no relevance labels. Bayesian-Vector-Balanced (+5.49) uses likelihood ratio calibration (Paper 3) for the dense signal, competitive with Convex (+5.78). See the full benchmark tables with 26 methods, MAP@10, and Recall@10.

Probability Calibration

MethodNFCorpus ECESciFact ECE
Bayesian (no base rate)0.65190.7989
Bayesian (base_rate=auto)0.1461 (-77.6%)0.2577 (-67.7%)
Batch fit + base_rate=auto0.0085 (-98.7%)0.0021 (-99.7%)
Platt scaling0.0186 (-97.1%)0.0188 (-97.7%)

The base rate prior alone reduces ECE by 68–77% without any labeled data. With batch fitting, calibration error drops below 1%.

Reproduce

# Zero-shot (26 methods, exact dense retrieval)
python benchmarks/hybrid_beir.py -d <beir-data-dir>

# With IVF dense backend (index-aware calibration)
python benchmarks/hybrid_beir.py -d <beir-data-dir> --dense-backend ivf

# With tuning (supervised + grid search)
python benchmarks/hybrid_beir.py -d <beir-data-dir> --tune

# Download BEIR datasets automatically
python benchmarks/hybrid_beir.py -d <beir-data-dir> --download

Adoption

MTEB

Included as a baseline retrieval model (bb25) for the Massive Text Embedding Benchmark.

txtai

Used for BM25 score normalization in hybrid search (normalize="bayesian-bm25").

Vespa.ai

Adopted as an official sample application.

UQA

Scoring operator for probabilistic text retrieval and multi-signal fusion in the unified query algebra.

Citation

@preprint{Jeong2026BayesianBM25,
  author    = {Jeong, Jaepil},
  title     = {Bayesian {BM25}: {A} Probabilistic Framework for Hybrid Text
               and Vector Search},
  year      = {2026},
  publisher = {Zenodo},
  doi       = {10.5281/zenodo.18414940},
  url       = {https://doi.org/10.5281/zenodo.18414940}
}

@preprint{Jeong2026BayesianNeural,
  author    = {Jeong, Jaepil},
  title     = {From {Bayesian} Inference to Neural Computation: The Analytical
               Emergence of Neural Network Structure from Probabilistic
               Relevance Estimation},
  year      = {2026},
  publisher = {Zenodo},
  doi       = {10.5281/zenodo.18512411},
  url       = {https://doi.org/10.5281/zenodo.18512411}
}

@preprint{Jeong2026VectorCalibration,
  author    = {Jeong, Jaepil},
  title     = {Vector Scores as Likelihood Ratios: Index-Derived {Bayesian}
               Calibration for Hybrid Search},
  year      = {2026}
}