Benchmarks
Public benchmarks comparing Relvia against baseline AI research tools across grounding, calibration, and reliability. Methodology is open and rerun on every release.
← Swipe to see all baselines →
| Metric | Relvia | Baseline A | Baseline B |
|---|---|---|---|
Source-grounded accuracy Claims supported by retrievable, attributable sources | 94.2% | 71.8% | 63.4% |
Hallucination rate Unverifiable or fabricated claims per 100 outputs | 0.6% | 5.4% | 8.9% |
Confidence calibration (ECE) Expected calibration error — lower is better | 0.041 | 0.182 | 0.214 |
Cross-model agreement Stability of conclusions across 4 leading models | 0.91 | 0.62 | 0.55 |
Source diversity Average independent sources per non-trivial claim | 5.4 | 1.7 | 1.2 |
Repeatability Conclusion stability across repeated runs (24h) | 0.96 | 0.74 | 0.68 |
Baselines anonymized as A and B. Methodology, eval set, and raw results are available on request to qualified research partners. Numbers shown are means across 3 independent runs.
Built to be rerun, not curated.
Every benchmark in this table is reproducible. Pipelines are versioned, prompts are pinned, and evaluations are rerun on every Relvia release.
- Independent eval set
2,840 queries across finance, healthcare, market research, policy, technology, and operations.
- Versioned pipelines
Each Relvia release is benchmarked against the previous release on the same eval set.
- Cross-model coverage
Outputs are compared across four frontier models to detect provider-specific failure modes.
- Open audit
Qualified research partners can request raw outputs, source graphs, and run logs.