Alex Jacobs; Dany Ayvazov • SALUS Safety • October 2025
A comprehensive benchmark of 1,023 safety questions
Test your safety knowledge (1/11)
When using Corotech Polyamide Epoxy Primer Red CV150, what is the ACGIH recommended Time-Weighted Average (TWA) exposure limit for ethyl benzene?
We present SALUS IQ, achieving 94.0% accuracy on construction safety queries— 13-19 percentage points better than GPT-5, Claude, and Gemini.
Construction safety documentation is critical but underserved by general-purpose AI. ChatGPT, Claude, and Gemini frequently hallucinate plausible but incorrect safety information—a high-stakes failure mode that can lead to injuries or fatalities.
SALUS IQ solves this through a domain-specific AI system that combines hybrid retrieval (dense vectors + BM25 + reranking) with safety-optimized prompting to deliver 100% document-grounded answers.
First comprehensive safety AI benchmark (1,023 questions)
Improvement over frontier LLMs (p < 0.001)
Systematic failure modes in general-purpose LLMs
SDS, manuals, regulations follow distinct structures requiring specialized parsing
Incorrect safety info can result in injuries, fatalities, legal liability
Regulations update frequently, requiring version-aware retrieval
State and federal regulations may conflict or complement each other
ChatGPT, Gemini, Claude hallucinate plausible but incorrect safety information
Returns relevant documents but requires manual extraction of specific answers
Lack flexibility for natural language queries, require constant maintenance
This paper makes the following contributions:
SALUS IQ is a production domain-specific AI system designed specifically for construction safety. Unlike general-purpose LLMs, it combines hybrid retrieval, safety-optimized prompting, and 100% document grounding to deliver verifiable answers to safety-critical questions.
Type-aware parsing for SDS, regulations, standards, manuals
Dense + sparse vector search with metadata filtering
Multi-signal combining semantic similarity, keywords, learned rankers
Model-based decomposition into canonical search terms
Checks retrieved spans for query alignment
Safety-optimized prompt enforcing grounding & compliance
Combines dense vectors (semantic), BM25 (keyword), and learned reranking for robust document retrieval across heterogeneous safety docs
Specialized parsing for SDS, regulations, equipment manuals, and standards—each requiring different extraction strategies
Custom prompts that enforce citation of sources, prioritize accuracy over completeness, and flag uncertainty
Plus 5 other sources: Safety Alerts (27), Reports (24), Forms (10), Policies (10), Other (10)
Worker-phrased safety queries
Document-grounded response
Plausible distractors
Specification, compliance, hazards, etc.
State/province tagging where relevant
Validated for suitability & accuracy
Plus 5 more types: Emergency (31), What PPE (28), Who Responsible (23), Inspection (15), Incident (6)
{
"mc_question": "In Michigan, what is the required service interval and periodic test voltage for a fiberglass live-line tool used for primary employee protection?",
"mc_options": [
{
"label": "a",
"text": "Every year, tested at 50,000 volts per foot for 1 minute.",
"is_correct": false
},
{
"label": "b",
"text": "Every 2 years, tested at 100,000 volts per foot for 5 minutes.",
"is_correct": false
},
{
"label": "c",
"text": "Every year, tested at 100,000 volts per foot for 5 minutes.",
"is_correct": false
},
{
"label": "d",
"text": "Every 2 years, tested at 75,000 volts per foot for 1 minute.",
"is_correct": true
}
],
"mc_correct_answer": "d"
}Evaluation is performed with a dedicated script that:
We evaluate:
All baselines receive identical multiple-choice prompts with zero-shot evaluation.
| System | Accuracy | 95% CI | Notes |
|---|---|---|---|
| SALUS IQ (2025.10) | 94.0% | [92.57%, 95.41%] | Hybrid RAG |
| GPT‑5 (Api) | 80.3% | [77.81%, 82.70%] | Zero-shot |
| Claude 4.1 Opus | 80.9% | [78.49%, 83.28%] | Zero-shot |
| Claude 4.5 Sonnet | 75.1% | [72.43%, 77.71%] | Zero-shot |
| Gemini 2.5 Pro | 79.0% | [76.44%, 81.43%] | Zero-shot |
Table 1: Overall benchmark results on SALUS-SafetyQA (n=1,023 questions). SALUS IQ shows 13.78 pp improvement over GPT-5, 13.10 pp over Claude 4.1 Opus, 18.96 pp over Claude 4.5 Sonnet, and 15.05 pp over Gemini 2.5 Pro. All comparisons are statistically significant (p < 0.001, McNemar's test).




All pairwise comparisons between SALUS IQ and baseline models show highly significant differences (McNemar's test, p < 0.001):
| Comparison | SALUS Wins | Baseline Wins | Net Advantage | p-value |
|---|---|---|---|---|
| vs GPT-5 | 171 | 30 | +141 | p < 0.001 |
| vs Claude 4.1 Opus | 163 | 29 | +134 | p < 0.001 |
| vs Claude 4.5 Sonnet | 225 | 31 | +194 | p < 0.001 |
| vs Gemini 2.5 Pro | 188 | 34 | +154 | p < 0.001 |
Table 2: McNemar's test results showing pairwise comparisons. On questions where models disagreed, SALUS IQ answered correctly in 84.9%–87.9% of cases (mean: 85.6%), demonstrating consistent superiority across all comparisons.
Key insight: SALUS IQ achieves 90%+ accuracy on all question types, with consistent performance across categories (including 92.0% on specifications).
LLMs hallucinate plausible but incorrect numerical values—the highest failure mode
Procedural knowledge shows 32pp performance gap—biggest weakness
✓ Enables human-in-the-loop: flag predictions <0.85 for expert review
Narrow gaps = overconfident when wrong → unsuitable for safety-critical applications
Note: Questions are designed to be answerable with publicly available safety documentation (SDS, OSHA regulations, equipment manuals) to enable independent evaluation.
The SALUS-SafetyQA benchmark is released under CC-BY-4.0 license:
We present SALUS IQ, a domain-specialized construction safety AI system, and introduce the SALUS-SafetyQA Benchmark containing 1,023 expert-validated questions. Results show that tailored RAG pipelines significantly outperform general-purpose LLMs in high-stakes safety domains, with 13-19 percentage point improvements and 100% document grounding.
The benchmark reveals systematic failure modes in frontier LLMs: hallucination of plausible but incorrect specifications, poor performance on equipment manuals, and overconfidence calibration. These findings demonstrate the critical need for domain-specific systems in safety-critical applications.
Future work includes expanding jurisdictional coverage, incorporating multimodal inputs (drawings, diagrams), and developing open-ended evaluation protocols.
Get the full paper, benchmark dataset, and evaluation code.
@techreport{salus2025safetyqa,
title={SALUS IQ: Technical White Paper & Benchmark Report},
author={Alex Jacobs and Dany Ayvazov},
institution={SALUS Safety},
year={2025},
month={October},
note={Version 1.0}
}