Score to Probability
pip install bayesian-bm25.
Convert raw BM25 scores into calibrated relevance probabilities via the full Bayesian pipeline (sigmoid likelihood + composite prior + posterior).
Pythonimport numpy as np
from bayesian_bm25 import BayesianProbabilityTransform
transform = BayesianProbabilityTransform(alpha=1.5, beta=1.0, base_rate=0.01)
scores = np.array([0.5, 1.0, 1.5, 2.0, 3.0])
tfs = np.array([1, 2, 3, 5, 8])
doc_len_ratios = np.array([0.3, 0.5, 0.8, 1.0, 1.5])
probabilities = transform.score_to_probability(scores, tfs, doc_len_ratios)
End-to-End Search
Drop-in scorer wrapping bm25s that returns calibrated
probabilities instead of raw scores.
from bayesian_bm25 import BayesianBM25Scorer
corpus_tokens = [
["python", "machine", "learning"],
["deep", "learning", "neural", "networks"],
["data", "visualization", "tools"],
]
scorer = BayesianBM25Scorer(k1=1.2, b=0.75, method="lucene", base_rate="auto")
scorer.index(corpus_tokens, show_progress=False)
doc_ids, probabilities = scorer.retrieve([["machine", "learning"]], k=3)
Multi-Field Search
Separate BM25 indexes per field with automatic fusion via log-odds conjunction.
Pythonfrom bayesian_bm25 import MultiFieldScorer
documents = [
{"title": ["bayesian", "bm25"], "body": ["probabilistic", "framework", "search"]},
{"title": ["neural", "networks"], "body": ["deep", "learning", "models"]},
{"title": ["information", "retrieval"], "body": ["search", "ranking", "relevance"]},
]
scorer = MultiFieldScorer(
fields=["title", "body"],
field_weights={"title": 0.4, "body": 0.6},
k1=1.2, b=0.75, method="lucene",
)
scorer.index(documents, show_progress=False)
doc_ids, probabilities = scorer.retrieve(["bayesian", "search"], k=3)
Signal Fusion
Combine multiple probability signals with Boolean and log-odds operations.
Pythonimport numpy as np
from bayesian_bm25 import log_odds_conjunction, prob_and, prob_not, prob_or
signals = np.array([0.85, 0.70, 0.60])
prob_and(signals) # 0.357 (shrinkage problem)
log_odds_conjunction(signals) # 0.773 (agreement-aware)
# Exclusion query: "python AND NOT java"
p_python, p_java = 0.90, 0.75
prob_and(np.array([p_python, prob_not(p_java)])) # 0.225
Hybrid Text + Vector Search
Fuse BM25 probabilities with dense vector similarity scores.
Pythonimport numpy as np
from bayesian_bm25 import cosine_to_probability, log_odds_conjunction
# BM25 probabilities (from Bayesian BM25)
bm25_probs = np.array([0.85, 0.60, 0.40])
# Vector search cosine similarities -> probabilities
cosine_scores = np.array([0.92, 0.35, 0.70])
vector_probs = cosine_to_probability(cosine_scores) # [0.96, 0.675, 0.85]
# Fuse with reliability weights (BM25 weight=0.6, vector weight=0.4)
stacked = np.stack([bm25_probs, vector_probs], axis=-1)
fused = log_odds_conjunction(stacked, weights=np.array([0.6, 0.4]))
# Fuse with weights and confidence scaling (alpha + weights compose)
fused = log_odds_conjunction(stacked, alpha=0.5, weights=np.array([0.6, 0.4]))
# Gated fusion: ReLU/Swish activation in logit space
fused_relu = log_odds_conjunction(stacked, gating="relu") # MAP estimation
fused_swish = log_odds_conjunction(stacked, gating="swish") # Bayes estimation
fused_gelu = log_odds_conjunction(stacked, gating="gelu") # Gaussian noise model
fused_softplus = log_odds_conjunction(stacked, gating="softplus") # evidence-preserving
# Generalized beta controls gate sharpness (Theorem 6.7.6)
fused_soft = log_odds_conjunction(stacked, gating="swish", gating_beta=0.5)
Vector Score Calibration
Calibrate vector distances into probabilities via likelihood ratio framework.
calibrate() uses the same distances for density estimation and evaluation.
calibrate_with_sample() decouples the two for index-aware ANN calibration.
import numpy as np
from bayesian_bm25 import VectorProbabilityTransform, ivf_density_prior
# Estimate background distribution from corpus distances
corpus_distances = np.random.normal(0.8, 0.15, size=10000)
vpt = VectorProbabilityTransform.fit_background(corpus_distances, base_rate=0.01)
# Basic calibration: same distances for density estimation and evaluation
query_distances = np.array([0.3, 0.5, 0.7, 0.9, 1.1])
probabilities = vpt.calibrate(query_distances)
# With BM25 probability weights for informed density estimation
bm25_probs = np.array([0.85, 0.60, 0.40, 0.20, 0.10])
probabilities = vpt.calibrate(query_distances, weights=bm25_probs)
# Index-aware calibration: density from local ANN sample,
# probabilities for a separate evaluation set
sample_distances = np.array([0.10, 0.15, 0.20, 0.50, 0.75, 0.80, 0.85])
eval_distances = np.array([0.12, 0.30, 0.70])
probabilities = vpt.calibrate_with_sample(
eval_distances, sample_distances,
weights=bm25_probs[:3],
)
# IVF density prior: denser cells suggest more relevant neighborhoods
cell_prior = ivf_density_prior(cell_population=150, avg_population=100)
probabilities = vpt.calibrate_with_sample(
eval_distances, sample_distances,
density_prior=np.full(7, cell_prior),
)
Learnable Weights
Learn per-signal reliability from labeled data with Hebbian gradients.
Pythonimport numpy as np
from bayesian_bm25 import LearnableLogOddsWeights
# 3 retrieval signals: BM25, vector search, metadata match
learner = LearnableLogOddsWeights(n_signals=3, alpha=0.0)
# Initial weights are uniform: [0.333, 0.333, 0.333]
# Batch fit from labeled data (probs: m x 3, labels: m)
learner.fit(training_probs, training_labels, learning_rate=0.1)
# Learned weights reflect signal reliability: [0.70, 0.19, 0.11]
# Online refinement from streaming feedback
for probs, label in feedback_stream:
learner.update(probs, label, learning_rate=0.05, momentum=0.9)
# Inference with Polyak-averaged weights for stability
fused = learner(test_probs, use_averaged=True)
Attention-Based Fusion
Query-dependent signal weights via attention mechanism with optional per-signal logit normalization.
Pythonimport numpy as np
from bayesian_bm25 import AttentionLogOddsWeights
# 2 retrieval signals, 3 query features, per-signal logit normalization
attn = AttentionLogOddsWeights(
n_signals=2, n_query_features=3, alpha=0.5, normalize=True,
)
# Train on labeled data with query features
# training_probs: (m, 2), training_labels: (m,), query_features: (m, 3)
attn.fit(training_probs, training_labels, query_features,
learning_rate=0.01, max_iterations=500)
# Query-dependent fusion: weights adapt per query
fused = attn(test_probs, test_features, use_averaged=True)
Multi-Head Attention Fusion
Multiple attention heads with pruning for efficient re-ranking.
Pythonimport numpy as np
from bayesian_bm25 import MultiHeadAttentionLogOddsWeights
# 4 heads, 2 signals, 3 query features
mh = MultiHeadAttentionLogOddsWeights(
n_heads=4, n_signals=2, n_query_features=3, alpha=0.5,
)
# Train all heads (different init -> different learned patterns)
mh.fit(training_probs, training_labels, query_features,
learning_rate=0.01, max_iterations=500)
# Inference: average log-odds across heads, then sigmoid
fused = mh(test_probs, test_features, use_averaged=True)
# Attention pruning: safely eliminate low-probability candidates
surviving_idx, fused_probs = mh.prune(
candidate_probs, query_features, threshold=0.5,
upper_bound_probs=candidate_upper_bounds,
)
Gating Functions
GELU, Softplus, and generalized Swish gating for noisy multi-signal fusion.
Pythonimport numpy as np
from bayesian_bm25 import log_odds_conjunction
signals = np.array([0.9, 0.3, 0.7])
# Compare gating functions
none = log_odds_conjunction(signals, gating="none") # no gating
relu = log_odds_conjunction(signals, gating="relu") # MAP estimation
swish = log_odds_conjunction(signals, gating="swish") # Bayes estimation
gelu = log_odds_conjunction(signals, gating="gelu") # Gaussian noise model
sp = log_odds_conjunction(signals, gating="softplus")# evidence-preserving
# Generalized swish: beta controls gate sharpness
# beta -> 0: x/2 (soft), beta = 1: standard swish, beta -> inf: ReLU
soft = log_odds_conjunction(signals, gating="swish", gating_beta=0.5)
hard = log_odds_conjunction(signals, gating="swish", gating_beta=5.0)
# GELU = Swish with beta=1.702
gelu_equiv = log_odds_conjunction(signals, gating="swish", gating_beta=1.702)
# Softplus for small datasets: preserves all evidence (Remark 6.5.4)
# softplus(x) > x for all finite x, so use lower alpha to compensate
sp_gentle = log_odds_conjunction(signals, gating="softplus", alpha=0.3)
Neural Score Calibration
Calibrate neural reranker scores into probabilities for Bayesian fusion.
Pythonfrom bayesian_bm25.calibration import PlattCalibrator, IsotonicCalibrator
from bayesian_bm25 import log_odds_conjunction
# Platt scaling: P = sigmoid(a * score + b)
platt = PlattCalibrator()
platt.fit(neural_scores, labels, learning_rate=0.01, max_iterations=1000)
calibrated = platt.calibrate(new_scores) # output in (0, 1)
# Isotonic regression: non-parametric monotone mapping via PAVA
iso = IsotonicCalibrator()
iso.fit(neural_scores, labels)
calibrated = iso.calibrate(new_scores)
# Combine calibrated neural scores with BM25 probabilities
stacked = np.stack([bm25_probs, calibrated], axis=-1)
fused = log_odds_conjunction(stacked)
Temporal Adaptation
Adapt to changing relevance patterns over time with exponential decay.
Pythonfrom bayesian_bm25.probability import TemporalBayesianTransform
# Short half-life: adapt quickly to changing patterns
transform = TemporalBayesianTransform(
alpha=1.0, beta=0.0, decay_half_life=100.0,
)
# Batch fit with timestamps: recent data gets more weight
transform.fit(scores, labels, timestamps=timestamps)
# Online update: timestamp auto-increments
for score, label in feedback_stream:
transform.update(score, label)
WAND Pruning
Compute safe Bayesian probability upper bounds for efficient top-k retrieval with document pruning.
Pythonfrom bayesian_bm25 import BayesianProbabilityTransform
transform = BayesianProbabilityTransform(alpha=1.5, beta=2.0, base_rate=0.01)
# Standard BM25 upper bound per query term
bm25_upper_bound = 5.0
# Bayesian upper bound for safe pruning — any document's actual
# probability is guaranteed to be at most this value
bayesian_bound = transform.wand_upper_bound(bm25_upper_bound)
Debugging the Fusion Pipeline
Trace every intermediate value through the full pipeline for transparent inspection, document comparison, and crossover detection.
Pythonfrom bayesian_bm25 import BayesianProbabilityTransform
from bayesian_bm25.debug import FusionDebugger
transform = BayesianProbabilityTransform(alpha=0.45, beta=6.10, base_rate=0.02)
debugger = FusionDebugger(transform)
# Trace a single document through the full pipeline
trace = debugger.trace_document(
bm25_score=8.42, tf=5, doc_len_ratio=0.60,
cosine_score=0.74, doc_id="doc-42",
)
print(debugger.format_trace(trace))
# Compare two documents to see which signal drove the rank difference
trace_a = debugger.trace_document(bm25_score=8.42, tf=5, doc_len_ratio=0.60, cosine_score=0.74)
trace_b = debugger.trace_document(bm25_score=5.10, tf=2, doc_len_ratio=1.20, cosine_score=0.88)
comparison = debugger.compare(trace_a, trace_b)
print(debugger.format_comparison(comparison))
# Hierarchical fusion: AND(OR(title, body), vector, NOT(spam))
step1 = debugger.trace_fusion([0.85, 0.70], names=["title", "body"], method="prob_or")
step2 = debugger.trace_not(0.90, name="spam")
step3 = debugger.trace_fusion(
[step1.fused_probability, 0.80, step2.complement],
names=["OR(title,body)", "vector", "NOT(spam)"],
method="prob_and",
)
Evaluating Calibration Quality
Measure how well the output probabilities match actual relevance rates.
Pythonimport numpy as np
from bayesian_bm25 import (
expected_calibration_error, brier_score, reliability_diagram, calibration_report,
)
probabilities = np.array([0.9, 0.8, 0.3, 0.1, 0.7, 0.2])
labels = np.array([1.0, 1.0, 0.0, 0.0, 1.0, 0.0])
ece = expected_calibration_error(probabilities, labels) # lower is better
bs = brier_score(probabilities, labels) # lower is better
bins = reliability_diagram(probabilities, labels, n_bins=5) # (avg_pred, avg_actual, count)
# One-call diagnostic report
report = calibration_report(probabilities, labels)
print(report.summary()) # formatted text with ECE, Brier, and reliability table
Online Learning from User Feedback
Refine parameters from streaming feedback with EMA-smoothed SGD and Polyak averaging.
Pythonfrom bayesian_bm25 import BayesianProbabilityTransform
transform = BayesianProbabilityTransform(alpha=1.0, beta=0.0)
# Batch warmup on historical data
transform.fit(historical_scores, historical_labels)
# Online refinement from live feedback
for score, label in feedback_stream:
transform.update(score, label, learning_rate=0.01, momentum=0.95)
# Use Polyak-averaged parameters for stable inference
alpha = transform.averaged_alpha
beta = transform.averaged_beta
Training Modes
Three modes control how gradients flow through the Bayesian pipeline.
Pythonfrom bayesian_bm25 import BayesianProbabilityTransform
transform = BayesianProbabilityTransform(alpha=1.0, beta=0.0)
# C1 (balanced, default): train on sigmoid likelihood
transform.fit(scores, labels, mode="balanced")
# C2 (prior-aware): train on full Bayesian posterior
transform.fit(scores, labels, mode="prior_aware", tfs=tfs, doc_len_ratios=ratios)
# C3 (prior-free): train on likelihood, inference uses prior=0.5
transform.fit(scores, labels, mode="prior_free")
Source Files
Complete runnable scripts are in the examples/ directory:
| File | Description |
|---|---|
basic_probability.py | Simple score-to-probability conversion |
search_and_retrieve.py | End-to-end retrieval workflow |
multi_field_search.py | Title + body field indexing and fusion |
score_fusion.py | Combining BM25 + vector probabilities |
learnable_fusion.py | Batch fit + online update of weights |
online_learning.py | Streaming feedback refinement |
threshold_filtering.py | Probability-based filtering workflows |
boolean_not.py | Exclusion queries ("python AND NOT java") |
fusion_debugger.py | 12 examples of pipeline debugging and tracing |
gating_functions.py | GELU/Softplus gating, generalized beta, noise filtering, small dataset demo |
neural_calibration.py | Platt and isotonic calibration for neural rerankers |
temporal_adaptation.py | Concept drift detection and half-life tuning |
multi_head_fusion.py | Multi-head attention fusion with pruning |
live_ranking.py | Live ranking demo showing online learning rank swaps with simulated editorial feedback |