Bayesian BM25

Core Theory

The Problem with Raw BM25 Scores

Standard BM25 produces unbounded scores that lack consistent meaning across queries. A score of 8.5 might be highly relevant for one query but mediocre for another, making threshold-based filtering and multi-signal fusion unreliable. Bayesian BM25 solves this by transforming scores into calibrated probabilities in $[0, 1]$.

Sigmoid Likelihood

The likelihood of relevance given a BM25 score follows a sigmoid model parameterized by steepness $\alpha$ and midpoint $\beta$:

L(R \mid s) = \sigma(\alpha \cdot (s - \beta)) = \frac{1}{1 + e^{-\alpha(s - \beta)}}

This maps any real-valued BM25 score $s$ into $(0, 1)$, capturing the intuition that relevance probability increases with score but saturates at the extremes.

Composite Prior

A document's prior relevance probability combines two information-theoretic signals: term frequency (higher TF indicates richer topical content) and document length ratio (documents near the corpus average length are more likely to be relevant):

\pi(d) = \pi_{tf}(d) \cdot \pi_{len}(d)

where $\pi_{tf} = \sigma(\log(1 + tf))$ and $\pi_{len} = 0.5 + 0.5 \cdot (1 - |r - 1|)$ with $r$ being the document length ratio. The composite prior is multiplicative in probability space, which becomes additive in log-odds space.

Bayesian Posterior

The final posterior combines likelihood, prior, and a corpus-level base rate $b_r$ using Bayes' theorem. In log-odds space, the three terms decompose additively:

\text{logit}(P) = \text{logit}(L) + \text{logit}(\pi) + \text{logit}(b_r)

This three-term decomposition separates the score-dependent signal (likelihood) from document-specific context (prior) and corpus-level prevalence (base rate), enabling each component to be estimated independently. The base rate alone reduces Expected Calibration Error by 68–77% without requiring any relevance labels.

Log-Odds Conjunction

Naive probability multiplication ($P = \prod P_i$) suffers from conjunction shrinkage: combining multiple agreeing high-probability signals produces a low fused probability. The log-odds conjunction resolves this by averaging in logit space with confidence scaling:

P_{fused} = \sigma\!\left(n^{\alpha} \cdot \frac{1}{n}\sum_{i=1}^{n} \text{logit}(P_i)\right)

The $n^{\alpha}$ factor (with $\alpha = 0.5$ giving $\sqrt{n}$ scaling) compensates for the averaging effect so that $n$ perfectly agreeing signals return the consensus probability, not a diluted version. With per-signal weights $w_i$, this becomes the Log-OP formulation $\sigma\!\bigl(n^{\alpha} \sum w_i \cdot \text{logit}(P_i)\bigr)$, where $\alpha$ and weights compose multiplicatively.

Pipeline

graph LR
    S["Raw BM25 Score"] --> SIG["Sigmoid Likelihood
L = sigma(alpha * (s - beta))"]
    SIG --> POST["Bayesian Posterior
logit(P) = logit(L) + logit(pi) + logit(br)"]
    TF["Term Frequency"] --> PRIOR["Composite Prior
pi = pi_tf * pi_len"]
    DL["Doc Length Ratio"] --> PRIOR
    PRIOR --> POST
    BR["Base Rate"] --> POST
    POST --> PROB["Calibrated
Probability"]
    PROB --> FUSE["Fusion
Log-Odds / AND / OR"]
    VEC["Vector Score"] --> COS["cosine_to_probability()"]
    VEC --> VPT["VectorProbabilityTransform
P = sigmoid(log(f_R/f_G) + logit(base))"]
    ANN["ANN Index
(IVF / HNSW)"] -.->|"local sample"| VPT
    COS --> FUSE
    VPT --> FUSE
    FUSE --> RANK["Final
Ranking"]

Score Transform

Sigmoid likelihood converts unbounded BM25 scores to (0, 1) probabilities with learnable steepness and midpoint parameters.

Composite Prior

Term frequency and document length signals provide document-specific relevance context independent of the query score.

Base Rate

Corpus-level relevance prevalence estimated from the score distribution (percentile, mixture model, or elbow detection). No labels needed.

Parameter Learning

Batch gradient descent or online SGD with EMA-smoothed gradients and Polyak averaging. Three training modes: C1, C2, C3.

Probabilistic Fusion

Log-odds conjunction, probabilistic AND/OR/NOT, per-signal weights, sparse gating (ReLU/Swish/GELU/Softplus), multi-head attention, and neural score calibration.

Vector Calibration

Likelihood ratio framework replaces $\frac{1+\cos\theta}{2}$ with density ratio estimation (KDE/GMM). calibrate_with_sample() decouples eval from density sample for index-aware ANN calibration.

WAND Pruning

Safe Bayesian probability upper bounds for document pruning in top-k retrieval. Block-max (BMW) bounds for tighter per-block pruning. Attention pruning for fusion.

Multi-Signal Fusion

All fusion operations work in probability space with well-defined semantics. Because every signal is a calibrated probability, they compose freely.

Function	Description	Use Case
`log_odds_conjunction()`	$\sigma\!\bigl(n^{\alpha}\sum w_i \cdot \text{logit}(P_i)\bigr)$ — agreement-aware fusion	Primary multi-signal fusion
`balanced_log_odds_fusion()`	Min-max normalize logits, then combine — equalizes signal scales	Hybrid BM25 + dense search
`prob_and()`	$P = \prod P_i$ — product rule in log-space	Strict conjunction (all must match)
`prob_or()`	$P = 1 - \prod(1 - P_i)$ — complement rule in log-space	Disjunction (any may match)
`prob_not()`	$P = 1 - P_i$ — complement	Exclusion queries
`cosine_to_probability()`	$(1 + \cos\theta) / 2$ with epsilon clamping	Convert vector similarity for fusion
`VectorProbabilityTransform`	$\sigma(\log(f_R / f_G) + \text{logit}(P_{base}))$ — likelihood ratio calibration	Calibrated vector similarity via density estimation (Paper 3)

Learnable Weights

LearnableLogOddsWeights learns per-signal reliability from labeled data via a Hebbian gradient that is backprop-free. Starting from the Naive Bayes uniform initialization ($w_i = 1/n$), the gradient $\nabla_{z_j} = n^{\alpha}(p - y) \cdot w_j(x_j - \bar{x}_w)$ adjusts weights based on pre-synaptic activity times post-synaptic error.

Attention-Based Fusion

AttentionLogOddsWeights replaces static weights with query-dependent attention. A linear projection from query features to softmax attention weights allows the fusion to adapt per query: some queries benefit more from lexical signals, others from semantic similarity. Optional per-signal logit normalization (normalize=True) equalizes signal scales before the weighted sum.

Sparse Signal Gating

For high-dimensional signal spaces, ReLU gating ($\max(0, \text{logit})$, MAP estimation), Swish gating ($\text{logit} \cdot \sigma(\text{logit})$, Bayes estimation), and GELU gating ($\text{logit} \cdot \sigma(1.702 \cdot \text{logit})$, Gaussian noise model) suppress noisy negative-logit signals before aggregation (Paper 2, Theorems 6.5.3/6.7.4/6.8.1). Softplus gating ($\log(1 + e^{\beta \cdot \text{logit}}) / \beta$, Remark 6.5.4) is a smooth ReLU that never zeroes out evidence, making it suitable for small datasets where discarding any signal is costly. The generalized Swish gate $\text{logit} \cdot \sigma(\beta \cdot \text{logit})$ interpolates between $x/2$ ($\beta \to 0$), standard Swish ($\beta = 1$), and ReLU ($\beta \to \infty$) via the gating_beta parameter (Theorem 6.7.6).

Multi-Head Attention

MultiHeadAttentionLogOddsWeights creates multiple independent attention heads with different random initializations. Each head produces fused log-odds independently, then the results are averaged in log-odds space before converting back to probability via sigmoid (Remark 8.6). Multi-head diversity reduces variance compared to single-head attention. Both single-head and multi-head attention support exact pruning via compute_upper_bounds() and prune() (Theorem 8.7.1).

Neural Score Calibration

PlattCalibrator (sigmoid: $P = \sigma(a \cdot s + b)$) and IsotonicCalibrator (PAVA monotone regression) convert raw neural model scores into calibrated probabilities suitable for Bayesian fusion. Calibrated scores can be combined with BM25 probabilities via log_odds_conjunction.

Vector Similarity Calibration

VectorProbabilityTransform replaces the naive $\frac{1+\cos\theta}{2}$ conversion with a likelihood ratio framework (Paper 3): $P(R|d) = \sigma(\log(f_R(d) / f_G(d)) + \text{logit}(P_{base}))$, where $f_R$ is estimated via weighted KDE or GMM-EM, and $f_G$ is the background Gaussian. Auto-routing selects KDE (gap detected, $K \geq 50$) or GMM (small $K$ or smooth distributions).

calibrate_with_sample() decouples density estimation from evaluation: the local ANN neighborhood (e.g. IVF probed cells) provides the training sample for $f_R$, while probabilities are produced for an arbitrary evaluation set. This is the index-aware calibration path where the density landscape comes from one set of distances and the output probabilities are needed at different points. Standalone helpers ivf_density_prior() and knn_density_prior() provide optional density priors from IVF cell populations or HNSW neighbor distances.

Benchmarks

Evaluated on 5 BEIR datasets using the retrieve-then-evaluate protocol (top-1000 per signal, union candidates, pytrec_eval). Dense encoder: all-MiniLM-L6-v2. BM25: k1=1.2, b=0.75, Lucene variant with Snowball English stemmer.

NDCG@10 (Zero-Shot)

Method	ArguAna	FiQA	NFCorpus	SciDocs	SciFact	Average
BM25	36.16	25.32	31.85	15.65	67.91	35.38
Dense	36.98	36.87	31.59	21.64	64.51	38.32
Convex	40.03	37.10	35.61	19.65	73.38	41.15
RRF	39.61	36.85	34.43	20.09	71.43	40.48
Bayesian-Balanced	37.27	40.59	35.73	21.40	72.47	41.50
Bayesian-Attn-Norm	37.21	40.43	35.43	21.91	73.22	41.64
Bayesian-Vector-Balanced	37.53	40.02	35.13	21.44	70.24	40.87
Bayesian-MultiHead-Norm	37.13	39.08	35.72	21.78	70.60	40.86

Delta vs BM25 (NDCG@10)

Method	Type	Delta
Bayesian-Attn-Norm	zero-shot	+6.28
Bayesian-Balanced	zero-shot	+6.12
Convex	zero-shot	+5.78
Bayesian-Vector-Attn	zero-shot	+5.60
Bayesian-Vector-Balanced	zero-shot	+5.49
Bayesian-MultiHead-Norm	zero-shot	+5.48
RRF	zero-shot	+5.10
Bayesian-MultiHead	zero-shot	+5.08
Bayesian-Attention	zero-shot	+4.99
Dense	zero-shot	+2.94

Bayesian-Attn-Norm achieves the highest average NDCG@10 (41.67%), outperforming Convex combination (+0.52), RRF (+1.18), and BM25 (+6.28) in a fully zero-shot setting with no relevance labels. Bayesian-Vector-Balanced (+5.49) uses likelihood ratio calibration (Paper 3) for the dense signal, competitive with Convex (+5.78). See the full benchmark tables with 26 methods, MAP@10, and Recall@10.

Probability Calibration

Method	NFCorpus ECE	SciFact ECE
Bayesian (no base rate)	0.6519	0.7989
Bayesian (base_rate=auto)	0.1461 (-77.6%)	0.2577 (-67.7%)
Batch fit + base_rate=auto	0.0085 (-98.7%)	0.0021 (-99.7%)
Platt scaling	0.0186 (-97.1%)	0.0188 (-97.7%)

The base rate prior alone reduces ECE by 68–77% without any labeled data. With batch fitting, calibration error drops below 1%.

Reproduce

# Zero-shot (26 methods, exact dense retrieval)
python benchmarks/hybrid_beir.py -d <beir-data-dir>

# With IVF dense backend (index-aware calibration)
python benchmarks/hybrid_beir.py -d <beir-data-dir> --dense-backend ivf

# With tuning (supervised + grid search)
python benchmarks/hybrid_beir.py -d <beir-data-dir> --tune

# Download BEIR datasets automatically
python benchmarks/hybrid_beir.py -d <beir-data-dir> --download

Citation

These papers develop a single Bayesian principle for hybrid search: every retrieval signal becomes a calibrated relevance probability through its log-likelihood ratio, and calibrated signals combine additively in log-odds space. The lexical calibration, neural fusion operators, and vector calibration of the first three are consolidated and refined for an information retrieval audience in the follow-up A Unified Bayesian Framework for Hybrid Search.

@preprint{Jeong2026BayesianBM25,
  author    = {Jeong, Jaepil},
  title     = {Bayesian {BM25}: {A} Probabilistic Framework for Hybrid Text
               and Vector Search},
  year      = {2026},
  publisher = {Zenodo},
  doi       = {10.5281/zenodo.18414940},
  url       = {https://doi.org/10.5281/zenodo.18414940}
}

@preprint{Jeong2026BayesianNeural,
  author    = {Jeong, Jaepil},
  title     = {From {Bayesian} Inference to Neural Computation: The Analytical
               Emergence of Neural Network Structure from Probabilistic
               Relevance Estimation},
  year      = {2026},
  publisher = {Zenodo},
  doi       = {10.5281/zenodo.18512411},
  url       = {https://doi.org/10.5281/zenodo.18512411}
}

@preprint{Jeong2026VectorLikelihoodRatios,
  author    = {Jeong, Jaepil},
  title     = {Vector Scores as Likelihood Ratios: {Index-Derived} {Bayesian}
               Calibration for Hybrid Search},
  year      = {2026},
  publisher = {Zenodo},
  doi       = {10.5281/zenodo.19181568},
  url       = {https://doi.org/10.5281/zenodo.19181568}
}

@preprint{Jeong2026UnifiedHybrid,
  author    = {Jeong, Jaepil},
  title     = {A Unified {Bayesian} Framework for Hybrid Search: Calibration
               and Log-Odds Fusion of Lexical and Vector Retrieval},
  year      = {2026},
  publisher = {Zenodo},
  doi       = {10.5281/zenodo.20768747},
  url       = {https://doi.org/10.5281/zenodo.20768747}
}

Core Theory

The Problem with Raw BM25 Scores

Sigmoid Likelihood

Composite Prior

Bayesian Posterior

Log-Odds Conjunction

Pipeline

Score Transform

Composite Prior

Base Rate

Parameter Learning

Probabilistic Fusion

Vector Calibration

WAND Pruning

Multi-Signal Fusion

Learnable Weights

Attention-Based Fusion

Sparse Signal Gating

Multi-Head Attention

Neural Score Calibration

Vector Similarity Calibration

Benchmarks

NDCG@10 (Zero-Shot)

Delta vs BM25 (NDCG@10)

Probability Calibration

Reproduce

Adoption

Apache Lucene

MTEB

txtai

Vespa.ai

UQA

Citation