Relvia Labs
Benchmarks

Benchmarks

Public benchmarks comparing Relvia against baseline AI research tools across grounding, calibration, and reliability. Methodology is open and rerun on every release.

Eval set
2,840 queries
Domains
6 verticals
Models tested
4 frontier
Last run
2026-04-22

← Swipe to see all baselines →

MetricRelviaBaseline ABaseline B
Source-grounded accuracy
Claims supported by retrievable, attributable sources
94.2%
71.8%
63.4%
Hallucination rate
Unverifiable or fabricated claims per 100 outputs
0.6%
5.4%
8.9%
Confidence calibration (ECE)
Expected calibration error — lower is better
0.041
0.182
0.214
Cross-model agreement
Stability of conclusions across 4 leading models
0.91
0.62
0.55
Source diversity
Average independent sources per non-trivial claim
5.4
1.7
1.2
Repeatability
Conclusion stability across repeated runs (24h)
0.96
0.74
0.68

Baselines anonymized as A and B. Methodology, eval set, and raw results are available on request to qualified research partners. Numbers shown are means across 3 independent runs.

Methodology

Built to be rerun, not curated.

Every benchmark in this table is reproducible. Pipelines are versioned, prompts are pinned, and evaluations are rerun on every Relvia release.

  • Independent eval set

    2,840 queries across finance, healthcare, market research, policy, technology, and operations.

  • Versioned pipelines

    Each Relvia release is benchmarked against the previous release on the same eval set.

  • Cross-model coverage

    Outputs are compared across four frontier models to detect provider-specific failure modes.

  • Open audit

    Qualified research partners can request raw outputs, source graphs, and run logs.