Safety Assistant
Powered by SALUSIQ
RESEARCH

SALUS IQ: Technical White Paper & Benchmark Report

Alex Jacobs; Dany Ayvazov • SALUS Safety • October 2025

Performance on SALUS-SafetyQA

A comprehensive benchmark of 1,023 safety questions

1
SALUS IQ (2025.10)
94%
2
GPT‑5 (Api)
80.3%
13.78pp
3
Claude 4.1 Opus
80.9%
13.1pp
4
Claude 4.5 Sonnet
75.1%
18.96pp
5
Gemini 2.5 Pro
79%
15.05pp
0
Benchmark Questions
0
Document Grounded
Download PDFAccess Benchmark
Share:

Contents

  • Abstract
  • 1. Introduction
  • 2. Related Work
  • 3. System Overview
  • 4. Dataset & Benchmark
  • 5. Methodology
  • 6. Results
  • 7. Analysis
  • 8. Transparency
  • 9. Conclusion
  • References
Try It Yourself

Sample Benchmark Question

Test your safety knowledge (1/11)

When using Corotech Polyamide Epoxy Primer Red CV150, what is the ACGIH recommended Time-Weighted Average (TWA) exposure limit for ethyl benzene?

Abstract

We present SALUS IQ, achieving 94.0% accuracy on construction safety queries— 13-19 percentage points better than GPT-5, Claude, and Gemini.

The Problem

Construction safety documentation is critical but underserved by general-purpose AI. ChatGPT, Claude, and Gemini frequently hallucinate plausible but incorrect safety information—a high-stakes failure mode that can lead to injuries or fatalities.

SALUS IQ solves this through a domain-specific AI system that combines hybrid retrieval (dense vectors + BM25 + reranking) with safety-optimized prompting to deliver 100% document-grounded answers.

Our Contributions

SALUS-SafetyQA

First comprehensive safety AI benchmark (1,023 questions)

13-19pp

Improvement over frontier LLMs (p < 0.001)

Failure Analysis

Systematic failure modes in general-purpose LLMs

RAGConstruction SafetyDomain AdaptationHybrid SearchBenchmark

1. Introduction

Why Construction Safety AI is Hard

Heterogeneous formats

SDS, manuals, regulations follow distinct structures requiring specialized parsing

High-stakes accuracy

Incorrect safety info can result in injuries, fatalities, legal liability

Temporal sensitivity

Regulations update frequently, requiring version-aware retrieval

Multi-jurisdictional complexity

State and federal regulations may conflict or complement each other

Why Existing Solutions Fail

⚠️
General-purpose LLMs

ChatGPT, Gemini, Claude hallucinate plausible but incorrect safety information

🔍
Traditional search

Returns relevant documents but requires manual extraction of specific answers

⚙️
Rule-based systems

Lack flexibility for natural language queries, require constant maintenance

1.2 Contributions

This paper makes the following contributions:

  1. SALUS-SafetyQA benchmark - The first comprehensive evaluation benchmark for construction safety AI, containing 1,023 expert-validated multiple-choice questions across 11 question types and 10 document source types
  2. Comprehensive comparative evaluation - Systematic evaluation of domain-specific RAG (SALUS IQ) versus frontier LLMs (GPT-5, Claude 4.1/4.5, Gemini 2.5 Pro) with rigorous statistical analysis demonstrating 13-19 percentage point improvements
  3. Analysis of domain adaptation gaps - Identification of systematic failure modes in general-purpose LLMs including hallucination of specifications, poor performance on equipment manuals, and overconfidence calibration issues

2. Related Work

2.1 Retrieval-Augmented Generation

Retrieval-Augmented Generation (RAG) systems enhance large language models (LLMs) with external knowledge retrieval [1, 2]. Recent advances include:

  • Dense retrieval: Dense Passage Retrieval (DPR) [3] and ColBERT [4]
  • Hybrid approaches: Techniques that combine dense and sparse signals [5, 6]
  • Reranking: Cross-encoders and other learned rankers [7]

SALUS IQ builds on these foundations with safety-specific adaptations for construction contexts.

2.2 Domain-Specific AI Systems

Specialized AI systems have demonstrated success in high-stakes domains:

  • Medical: Med-PaLM 2 achieved expert-level medical question answering performance [8]
  • Legal: The Reasoning-Focused Legal Retrieval Benchmark introduced domain-specific retrieval challenges [9]
  • Scientific: Galactica demonstrated strong performance on scientific reasoning tasks [10]

However, construction safety presents unique challenges not directly addressed by these systems.

2.3 Safety AI Benchmarks

Research in safety AI for construction is limited compared to other fields. Prior efforts include:

  • Accident cause classification using Word2Vec and deep learning [11]
  • Incident report analysis with machine learning [12]
  • Personal protective equipment (PPE) detection in images [13]

To our knowledge, SALUS-SafetyQA is the first comprehensive QA benchmark designed specifically for construction safety.

3. The SALUS IQ System

SALUS IQ is a production domain-specific AI system designed specifically for construction safety. Unlike general-purpose LLMs, it combines hybrid retrieval, safety-optimized prompting, and 100% document grounding to deliver verifiable answers to safety-critical questions.

Architecture: 6-Layer Retrieval Pipeline

1
Document processing

Type-aware parsing for SDS, regulations, standards, manuals

2
Hybrid search

Dense + sparse vector search with metadata filtering

3
Hybrid reranking

Multi-signal combining semantic similarity, keywords, learned rankers

4
Query expansion

Model-based decomposition into canonical search terms

5
Validation layer

Checks retrieved spans for query alignment

6
Answer synthesis

Safety-optimized prompt enforcing grounding & compliance

Key Innovations

Hybrid Retrieval

Combines dense vectors (semantic), BM25 (keyword), and learned reranking for robust document retrieval across heterogeneous safety docs

Document-Type Aware

Specialized parsing for SDS, regulations, equipment manuals, and standards—each requiring different extraction strategies

Safety-Optimized Prompting

Custom prompts that enforce citation of sources, prioritize accuracy over completeness, and flag uncertainty

Production Performance

<2s
Search latency (p95)
94.0%
Benchmark accuracy
Millions
Documents indexed
100%
Citation grounded

4. Dataset & Benchmark Design

Benchmark Question Distribution by Source

297
29%
SDS
Chemical hazards & PPE
218
21.3%
STANDARD
ANSI/OSHA standards
211
20.6%
REGULATION
State/federal regs
139
13.6%
MANUAL
Equipment O&M
77
7.5%
TRAINING
Instructional content

Plus 5 other sources: Safety Alerts (27), Reports (24), Forms (10), Policies (10), Other (10)

Question Generation Pipeline

📝
Natural-language question

Worker-phrased safety queries

✓
Expected answer

Document-grounded response

🔢
Multiple-choice (4 options)

Plausible distractors

🏷️
11 question categories

Specification, compliance, hazards, etc.

📍
Jurisdictional metadata

State/province tagging where relevant

✨
LLM-based generation

Validated for suitability & accuracy

Question Type Distribution

1,023
Total Questions
11 types across 10 document sources

Top Categories

Specification31.7%
324
Compliance22.5%
230
What Hazards15.6%
160
How To9.3%
95
When Required5.7%
58
Definition5.2%
53

Plus 5 more types: Emergency (31), What PPE (28), Who Responsible (23), Inspection (15), Incident (6)

Example Question

{
  "mc_question": "In Michigan, what is the required service interval and periodic test voltage for a fiberglass live-line tool used for primary employee protection?",
  "mc_options": [
    {
      "label": "a",
      "text": "Every year, tested at 50,000 volts per foot for 1 minute.",
      "is_correct": false
    },
    {
      "label": "b",
      "text": "Every 2 years, tested at 100,000 volts per foot for 5 minutes.",
      "is_correct": false
    },
    {
      "label": "c",
      "text": "Every year, tested at 100,000 volts per foot for 5 minutes.",
      "is_correct": false
    },
    {
      "label": "d",
      "text": "Every 2 years, tested at 75,000 volts per foot for 1 minute.",
      "is_correct": true
    }
  ],
  "mc_correct_answer": "d"
}

5. Evaluation Methodology

5.1 Benchmark Harness

Evaluation is performed with a dedicated script that:

  1. Loads questions from JSON
  2. Calls the SALUS IQ benchmark endpoint or model provider endpoint
  3. Records selected answer, correctness, reasoning, retrieval stats
  4. Computes metrics: accuracy, average confidence, retrieval usage

5.2 Comparative Baselines

We evaluate:

  • SALUS IQ (domain expert LLM system)
  • GPT‑5 (Api) (OpenAI)
  • Claude 4.1 Opus (Anthropic)
  • Claude 4.5 Sonnet (Anthropic)
  • Gemini 2.5 Pro (Google DeepMind)

All baselines receive identical multiple-choice prompts with zero-shot evaluation.

5.3 Metrics

  • Accuracy: Percentage of correct answers
  • Confidence calibration: Mean confidence split by correctness
  • Retrieval statistics: Average docs retrieved, percentage using context
  • Statistical significance: Bootstrap confidence intervals and McNemar's test (α=0.05)

6. Results

6.1 Overall Performance

SystemAccuracy95% CINotes
SALUS IQ (2025.10)94.0%[92.57%, 95.41%]Hybrid RAG
GPT‑5 (Api)80.3%[77.81%, 82.70%]Zero-shot
Claude 4.1 Opus80.9%[78.49%, 83.28%]Zero-shot
Claude 4.5 Sonnet75.1%[72.43%, 77.71%]Zero-shot
Gemini 2.5 Pro79.0%[76.44%, 81.43%]Zero-shot

Table 1: Overall benchmark results on SALUS-SafetyQA (n=1,023 questions). SALUS IQ shows 13.78 pp improvement over GPT-5, 13.10 pp over Claude 4.1 Opus, 18.96 pp over Claude 4.5 Sonnet, and 15.05 pp over Gemini 2.5 Pro. All comparisons are statistically significant (p < 0.001, McNemar's test).

6.2 Performance Visualizations

Overall model accuracy comparison
Figure 1: Overall model accuracy comparison across all 1,023 questions.
Accuracy by question type
Figure 2: Model accuracy breakdown by question type. SALUS IQ shows consistent performance across all categories, while baseline models struggle particularly with specification questions (71.0% for GPT-5, 68.8% for Claude 4.1).
Accuracy by source type
Figure 3: Model accuracy breakdown by source document type. Baseline models show significant performance degradation on MANUAL documents (61.9% for GPT-5, 69.1% for Claude 4.1) compared to SALUS IQ (94.2%).
Confidence calibration
Figure 4: Confidence analysis by correctness. SALUS IQ demonstrates superior calibration with a 0.168 gap between correct and incorrect answers.

6.3 Statistical Significance

All pairwise comparisons between SALUS IQ and baseline models show highly significant differences (McNemar's test, p < 0.001):

ComparisonSALUS WinsBaseline WinsNet Advantagep-value
vs GPT-517130+141p < 0.001
vs Claude 4.1 Opus16329+134p < 0.001
vs Claude 4.5 Sonnet22531+194p < 0.001
vs Gemini 2.5 Pro18834+154p < 0.001

Table 2: McNemar's test results showing pairwise comparisons. On questions where models disagreed, SALUS IQ answered correctly in 84.9%–87.9% of cases (mean: 85.6%), demonstrating consistent superiority across all comparisons.

7. Analysis

Performance Improvements

+13.78pp
vs GPT-5
80.3%
+13.10pp
vs Claude 4.1
80.9%
+18.96pp
vs Claude 4.5
75.1%
+15.05pp
vs Gemini 2.5
79.0%

Key insight: SALUS IQ achieves 90%+ accuracy on all question types, with consistent performance across categories (including 92.0% on specifications).

Biggest Performance Gaps

Specification Questions

31.7% of benchmark

LLMs hallucinate plausible but incorrect numerical values—the highest failure mode

92.0%
SALUS IQ
71.0%
GPT-5
68.8%
Claude 4.1
64.2%
Claude 4.5

Equipment Manual Questions

Largest gap

Procedural knowledge shows 32pp performance gap—biggest weakness

SALUS IQ: 94.2%
GPT-5: 61.9% (-32.3pp)
Claude 4.1: 69.1% (-25.1pp)

Error Analysis

SALUS IQ Errors

Only 5.96% (61 questions)
Retrieval failures65.6%
Ambiguous specs19.7%
Edge cases14.7%

LLM Errors

Average 21.7%
Hallucinated values43%
General knowledge substitution31%
Overconfidence26%

Confidence Calibration

SALUS IQ: Well-Calibrated

0.985
When correct
0.817
When incorrect
0.168
Gap (best)

✓ Enables human-in-the-loop: flag predictions <0.85 for expert review

Baseline Models: Poor Calibration

Claude 4.5 Sonnet:0.061 gap (worst)
Gemini 2.5 Pro:0.136 gap

Narrow gaps = overconfident when wrong → unsuitable for safety-critical applications

8. Transparency & Data Availability

8.1 What We Publish

  • Full benchmark dataset (1,023 questions) with question text, options, correct answers, and metadata
  • Complete evaluation harness (evaluate_benchmark.py) and configuration files
  • Statistical analysis code with bootstrap confidence intervals and McNemar's tests

8.2 What We Do Not Publish

  • Source document content or page snippets (proprietary corpus)
  • Document titles and internal database identifiers
  • Raw document files or training data

Note: Questions are designed to be answerable with publicly available safety documentation (SDS, OSHA regulations, equipment manuals) to enable independent evaluation.

8.3 Data Availability

The SALUS-SafetyQA benchmark is released under CC-BY-4.0 license:

  • Dataset & Code: github.com/Salus-Technologies/SALUS-SafetyQA
  • License: Creative Commons Attribution 4.0 International

8.4 Limitations

  • English-heavy corpus: Incomplete coverage for non-US regulations
  • Domain shift risk: New standards and equipment revisions may not be represented
  • Multiple-choice format: May overestimate performance relative to open-ended extraction
  • Temporal constraints: Safety regulations and standards change over time

9. Conclusion

We present SALUS IQ, a domain-specialized construction safety AI system, and introduce the SALUS-SafetyQA Benchmark containing 1,023 expert-validated questions. Results show that tailored RAG pipelines significantly outperform general-purpose LLMs in high-stakes safety domains, with 13-19 percentage point improvements and 100% document grounding.

The benchmark reveals systematic failure modes in frontier LLMs: hallucination of plausible but incorrect specifications, poor performance on equipment manuals, and overconfidence calibration. These findings demonstrate the critical need for domain-specific systems in safety-critical applications.

Future work includes expanding jurisdictional coverage, incorporating multimodal inputs (drawings, diagrams), and developing open-ended evaluation protocols.

References

  1. Lewis, P., et al. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. NeurIPS, 2020.
  2. Guu, K., et al. REALM: Retrieval-Augmented Language Model Pre-Training. ICML, 2020.
  3. Karpukhin, V., et al. Dense Passage Retrieval for Open-Domain Question Answering. EMNLP, 2020.
  4. Khattab, O., & Zaharia, M. ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT. SIGIR, 2020.
  5. Ma, X., et al. Query Rewriting for Retrieval-Augmented Large Language Models. arXiv:2305.14283, 2023.
  6. Ma, X., et al. Fine-Tuning LLaMA for Multi-Stage Text Retrieval. arXiv:2310.08319, 2023.
  7. Nogueira, R., & Cho, K. Passage Re-ranking with BERT. arXiv:1901.04085, 2019.
  8. Singhal, K., et al. Towards Expert-Level Medical Question Answering with Large Language Models. arXiv:2305.09617, 2023.
  9. Zheng, C., et al. A Reasoning-Focused Legal Retrieval Benchmark. arXiv:2505.03970, 2025.
  10. Taylor, R., et al. Galactica: A Large Language Model for Science. arXiv:2211.09085, 2022.
  11. Zhang, F., et al. A Hybrid Structured Deep Neural Network with Word2Vec for Construction Accident Causes Classification. Complex & Intelligent Systems, 2019.
  12. Tixier, A., et al. Application of Machine Learning to Construction Injury Prediction. Automation in Construction, 2016.
  13. Fang, Q., et al. Detecting Non-Hardhat Use by a Deep Learning Method from Far-Field Surveillance Videos. Automation in Construction, 2018.

Download & Access

Get the full paper, benchmark dataset, and evaluation code.

Download PDFAccess Benchmark

Citation

@techreport{salus2025safetyqa,
  title={SALUS IQ: Technical White Paper & Benchmark Report},
  author={Alex Jacobs and Dany Ayvazov},
  institution={SALUS Safety},
  year={2025},
  month={October},
  note={Version 1.0}
}