SALUS IQ: Technical White Paper & Benchmark Report
Alex Jacobs; Dany Ayvazov • SALUS Safety • October 2025
Performance on SALUS-SafetyQA
A comprehensive benchmark of 1,023 safety questions
Sample Benchmark Question
Test your safety knowledge (1/11)
When using Corotech Polyamide Epoxy Primer Red CV150, what is the ACGIH recommended Time-Weighted Average (TWA) exposure limit for ethyl benzene?
Abstract
We present SALUS IQ, achieving 94.0% accuracy on construction safety queries— 13-19 percentage points better than GPT-5, Claude, and Gemini.
The Problem
Construction safety documentation is critical but underserved by general-purpose AI. ChatGPT, Claude, and Gemini frequently hallucinate plausible but incorrect safety information—a high-stakes failure mode that can lead to injuries or fatalities.
SALUS IQ solves this through a domain-specific AI system that combines hybrid retrieval (dense vectors + BM25 + reranking) with safety-optimized prompting to deliver 100% document-grounded answers.
Our Contributions
First comprehensive safety AI benchmark (1,023 questions)
Improvement over frontier LLMs (p < 0.001)
Systematic failure modes in general-purpose LLMs
1. Introduction
Why Construction Safety AI is Hard
SDS, manuals, regulations follow distinct structures requiring specialized parsing
Incorrect safety info can result in injuries, fatalities, legal liability
Regulations update frequently, requiring version-aware retrieval
State and federal regulations may conflict or complement each other
Why Existing Solutions Fail
ChatGPT, Gemini, Claude hallucinate plausible but incorrect safety information
Returns relevant documents but requires manual extraction of specific answers
Lack flexibility for natural language queries, require constant maintenance
1.2 Contributions
This paper makes the following contributions:
- SALUS-SafetyQA benchmark - The first comprehensive evaluation benchmark for construction safety AI, containing 1,023 expert-validated multiple-choice questions across 11 question types and 10 document source types
- Comprehensive comparative evaluation - Systematic evaluation of domain-specific RAG (SALUS IQ) versus frontier LLMs (GPT-5, Claude 4.1/4.5, Gemini 2.5 Pro) with rigorous statistical analysis demonstrating 13-19 percentage point improvements
- Analysis of domain adaptation gaps - Identification of systematic failure modes in general-purpose LLMs including hallucination of specifications, poor performance on equipment manuals, and overconfidence calibration issues
3. The SALUS IQ System
SALUS IQ is a production domain-specific AI system designed specifically for construction safety. Unlike general-purpose LLMs, it combines hybrid retrieval, safety-optimized prompting, and 100% document grounding to deliver verifiable answers to safety-critical questions.
Architecture: 6-Layer Retrieval Pipeline
Type-aware parsing for SDS, regulations, standards, manuals
Dense + sparse vector search with metadata filtering
Multi-signal combining semantic similarity, keywords, learned rankers
Model-based decomposition into canonical search terms
Checks retrieved spans for query alignment
Safety-optimized prompt enforcing grounding & compliance
Key Innovations
Combines dense vectors (semantic), BM25 (keyword), and learned reranking for robust document retrieval across heterogeneous safety docs
Specialized parsing for SDS, regulations, equipment manuals, and standards—each requiring different extraction strategies
Custom prompts that enforce citation of sources, prioritize accuracy over completeness, and flag uncertainty
Production Performance
4. Dataset & Benchmark Design
Benchmark Question Distribution by Source
Plus 5 other sources: Safety Alerts (27), Reports (24), Forms (10), Policies (10), Other (10)
Question Generation Pipeline
Worker-phrased safety queries
Document-grounded response
Plausible distractors
Specification, compliance, hazards, etc.
State/province tagging where relevant
Validated for suitability & accuracy
Question Type Distribution
Top Categories
Plus 5 more types: Emergency (31), What PPE (28), Who Responsible (23), Inspection (15), Incident (6)
Example Question
{
"mc_question": "In Michigan, what is the required service interval and periodic test voltage for a fiberglass live-line tool used for primary employee protection?",
"mc_options": [
{
"label": "a",
"text": "Every year, tested at 50,000 volts per foot for 1 minute.",
"is_correct": false
},
{
"label": "b",
"text": "Every 2 years, tested at 100,000 volts per foot for 5 minutes.",
"is_correct": false
},
{
"label": "c",
"text": "Every year, tested at 100,000 volts per foot for 5 minutes.",
"is_correct": false
},
{
"label": "d",
"text": "Every 2 years, tested at 75,000 volts per foot for 1 minute.",
"is_correct": true
}
],
"mc_correct_answer": "d"
}
5. Evaluation Methodology
5.1 Benchmark Harness
Evaluation is performed with a dedicated script that:
- Loads questions from JSON
- Calls the SALUS IQ benchmark endpoint or model provider endpoint
- Records selected answer, correctness, reasoning, retrieval stats
- Computes metrics: accuracy, average confidence, retrieval usage
5.2 Comparative Baselines
We evaluate:
- SALUS IQ (domain expert LLM system)
- GPT‑5 (Api) (OpenAI)
- Claude 4.1 Opus (Anthropic)
- Claude 4.5 Sonnet (Anthropic)
- Gemini 2.5 Pro (Google DeepMind)
All baselines receive identical multiple-choice prompts with zero-shot evaluation.
5.3 Metrics
- Accuracy: Percentage of correct answers
- Confidence calibration: Mean confidence split by correctness
- Retrieval statistics: Average docs retrieved, percentage using context
- Statistical significance: Bootstrap confidence intervals and McNemar's test (α=0.05)
6. Results
6.1 Overall Performance
System | Accuracy | 95% CI | Notes |
---|---|---|---|
SALUS IQ (2025.10) | 94.0% | [92.57%, 95.41%] | Hybrid RAG |
GPT‑5 (Api) | 80.3% | [77.81%, 82.70%] | Zero-shot |
Claude 4.1 Opus | 80.9% | [78.49%, 83.28%] | Zero-shot |
Claude 4.5 Sonnet | 75.1% | [72.43%, 77.71%] | Zero-shot |
Gemini 2.5 Pro | 79.0% | [76.44%, 81.43%] | Zero-shot |
Table 1: Overall benchmark results on SALUS-SafetyQA (n=1,023 questions). SALUS IQ shows 13.78 pp improvement over GPT-5, 13.10 pp over Claude 4.1 Opus, 18.96 pp over Claude 4.5 Sonnet, and 15.05 pp over Gemini 2.5 Pro. All comparisons are statistically significant (p < 0.001, McNemar's test).
6.2 Performance Visualizations




6.3 Statistical Significance
All pairwise comparisons between SALUS IQ and baseline models show highly significant differences (McNemar's test, p < 0.001):
Comparison | SALUS Wins | Baseline Wins | Net Advantage | p-value |
---|---|---|---|---|
vs GPT-5 | 171 | 30 | +141 | p < 0.001 |
vs Claude 4.1 Opus | 163 | 29 | +134 | p < 0.001 |
vs Claude 4.5 Sonnet | 225 | 31 | +194 | p < 0.001 |
vs Gemini 2.5 Pro | 188 | 34 | +154 | p < 0.001 |
Table 2: McNemar's test results showing pairwise comparisons. On questions where models disagreed, SALUS IQ answered correctly in 84.9%–87.9% of cases (mean: 85.6%), demonstrating consistent superiority across all comparisons.
7. Analysis
Performance Improvements
Key insight: SALUS IQ achieves 90%+ accuracy on all question types, with consistent performance across categories (including 92.0% on specifications).
Biggest Performance Gaps
Specification Questions
31.7% of benchmarkLLMs hallucinate plausible but incorrect numerical values—the highest failure mode
Equipment Manual Questions
Largest gapProcedural knowledge shows 32pp performance gap—biggest weakness
Error Analysis
SALUS IQ Errors
Only 5.96% (61 questions)LLM Errors
Average 21.7%Confidence Calibration
SALUS IQ: Well-Calibrated
✓ Enables human-in-the-loop: flag predictions <0.85 for expert review
Baseline Models: Poor Calibration
Narrow gaps = overconfident when wrong → unsuitable for safety-critical applications
8. Transparency & Data Availability
8.1 What We Publish
- Full benchmark dataset (1,023 questions) with question text, options, correct answers, and metadata
- Complete evaluation harness (evaluate_benchmark.py) and configuration files
- Statistical analysis code with bootstrap confidence intervals and McNemar's tests
8.2 What We Do Not Publish
- Source document content or page snippets (proprietary corpus)
- Document titles and internal database identifiers
- Raw document files or training data
Note: Questions are designed to be answerable with publicly available safety documentation (SDS, OSHA regulations, equipment manuals) to enable independent evaluation.
8.3 Data Availability
The SALUS-SafetyQA benchmark is released under CC-BY-4.0 license:
- Dataset & Code: github.com/Salus-Technologies/SALUS-SafetyQA
- License: Creative Commons Attribution 4.0 International
8.4 Limitations
- English-heavy corpus: Incomplete coverage for non-US regulations
- Domain shift risk: New standards and equipment revisions may not be represented
- Multiple-choice format: May overestimate performance relative to open-ended extraction
- Temporal constraints: Safety regulations and standards change over time
9. Conclusion
We present SALUS IQ, a domain-specialized construction safety AI system, and introduce the SALUS-SafetyQA Benchmark containing 1,023 expert-validated questions. Results show that tailored RAG pipelines significantly outperform general-purpose LLMs in high-stakes safety domains, with 13-19 percentage point improvements and 100% document grounding.
The benchmark reveals systematic failure modes in frontier LLMs: hallucination of plausible but incorrect specifications, poor performance on equipment manuals, and overconfidence calibration. These findings demonstrate the critical need for domain-specific systems in safety-critical applications.
Future work includes expanding jurisdictional coverage, incorporating multimodal inputs (drawings, diagrams), and developing open-ended evaluation protocols.
References
- Lewis, P., et al. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. NeurIPS, 2020.
- Guu, K., et al. REALM: Retrieval-Augmented Language Model Pre-Training. ICML, 2020.
- Karpukhin, V., et al. Dense Passage Retrieval for Open-Domain Question Answering. EMNLP, 2020.
- Khattab, O., & Zaharia, M. ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT. SIGIR, 2020.
- Ma, X., et al. Query Rewriting for Retrieval-Augmented Large Language Models. arXiv:2305.14283, 2023.
- Ma, X., et al. Fine-Tuning LLaMA for Multi-Stage Text Retrieval. arXiv:2310.08319, 2023.
- Nogueira, R., & Cho, K. Passage Re-ranking with BERT. arXiv:1901.04085, 2019.
- Singhal, K., et al. Towards Expert-Level Medical Question Answering with Large Language Models. arXiv:2305.09617, 2023.
- Zheng, C., et al. A Reasoning-Focused Legal Retrieval Benchmark. arXiv:2505.03970, 2025.
- Taylor, R., et al. Galactica: A Large Language Model for Science. arXiv:2211.09085, 2022.
- Zhang, F., et al. A Hybrid Structured Deep Neural Network with Word2Vec for Construction Accident Causes Classification. Complex & Intelligent Systems, 2019.
- Tixier, A., et al. Application of Machine Learning to Construction Injury Prediction. Automation in Construction, 2016.
- Fang, Q., et al. Detecting Non-Hardhat Use by a Deep Learning Method from Far-Field Surveillance Videos. Automation in Construction, 2018.
Download & Access
Get the full paper, benchmark dataset, and evaluation code.
Citation
@techreport{salus2025safetyqa, title={SALUS IQ: Technical White Paper & Benchmark Report}, author={Alex Jacobs and Dany Ayvazov}, institution={SALUS Safety}, year={2025}, month={October}, note={Version 1.0} }