SALUS IQ: Technical White Paper & Benchmark Report

Alex Jacobs; Dany Ayvazov • SALUS Safety • October 2025

Performance on SALUS-SafetyQA

A comprehensive benchmark of 1,023 safety questions

SALUS IQ (2025.10)

94%

GPT‑5 (Api)

80.3%

13.78pp

Claude 4.1 Opus

80.9%

13.1pp

Claude 4.5 Sonnet

75.1%

18.96pp

Gemini 2.5 Pro

79%

15.05pp

Benchmark Questions

Document Grounded

Download PDF Access Benchmark

Try It Yourself

Sample Benchmark Question

Test your safety knowledge (1/11)

When using Corotech Polyamide Epoxy Primer Red CV150, what is the ACGIH recommended Time-Weighted Average (TWA) exposure limit for ethyl benzene?

Abstract

We present SALUS IQ, achieving 94.0% accuracy on construction safety queries— 13-19 percentage points better than GPT-5, Claude, and Gemini.

The Problem

Construction safety documentation is critical but underserved by general-purpose AI. ChatGPT, Claude, and Gemini frequently hallucinate plausible but incorrect safety information—a high-stakes failure mode that can lead to injuries or fatalities.

SALUS IQ solves this through a domain-specific AI system that combines hybrid retrieval (dense vectors + BM25 + reranking) with safety-optimized prompting to deliver 100% document-grounded answers.

Our Contributions

SALUS-SafetyQA

First comprehensive safety AI benchmark (1,023 questions)

13-19pp

Improvement over frontier LLMs (p < 0.001)

Failure Analysis

Systematic failure modes in general-purpose LLMs

RAGConstruction SafetyDomain AdaptationHybrid SearchBenchmark

1. Introduction

Why Construction Safety AI is Hard

Heterogeneous formats

SDS, manuals, regulations follow distinct structures requiring specialized parsing

High-stakes accuracy

Incorrect safety info can result in injuries, fatalities, legal liability

Temporal sensitivity

Regulations update frequently, requiring version-aware retrieval

Multi-jurisdictional complexity

State and federal regulations may conflict or complement each other

Why Existing Solutions Fail

⚠️

General-purpose LLMs

ChatGPT, Gemini, Claude hallucinate plausible but incorrect safety information

🔍

Traditional search

Returns relevant documents but requires manual extraction of specific answers

⚙️

Rule-based systems

Lack flexibility for natural language queries, require constant maintenance

1.2 Contributions

This paper makes the following contributions:

SALUS-SafetyQA benchmark - The first comprehensive evaluation benchmark for construction safety AI, containing 1,023 expert-validated multiple-choice questions across 11 question types and 10 document source types
Comprehensive comparative evaluation - Systematic evaluation of domain-specific RAG (SALUS IQ) versus frontier LLMs (GPT-5, Claude 4.1/4.5, Gemini 2.5 Pro) with rigorous statistical analysis demonstrating 13-19 percentage point improvements
Analysis of domain adaptation gaps - Identification of systematic failure modes in general-purpose LLMs including hallucination of specifications, poor performance on equipment manuals, and overconfidence calibration issues

3. The SALUS IQ System

SALUS IQ is a production domain-specific AI system designed specifically for construction safety. Unlike general-purpose LLMs, it combines hybrid retrieval, safety-optimized prompting, and 100% document grounding to deliver verifiable answers to safety-critical questions.

Architecture: 6-Layer Retrieval Pipeline

Document processing

Type-aware parsing for SDS, regulations, standards, manuals

Hybrid search

Dense + sparse vector search with metadata filtering

Hybrid reranking

Multi-signal combining semantic similarity, keywords, learned rankers

Query expansion

Model-based decomposition into canonical search terms

Validation layer

Checks retrieved spans for query alignment

Answer synthesis

Safety-optimized prompt enforcing grounding & compliance

Key Innovations

Hybrid Retrieval

Combines dense vectors (semantic), BM25 (keyword), and learned reranking for robust document retrieval across heterogeneous safety docs

Document-Type Aware

Specialized parsing for SDS, regulations, equipment manuals, and standards—each requiring different extraction strategies

Safety-Optimized Prompting

Custom prompts that enforce citation of sources, prioritize accuracy over completeness, and flag uncertainty

Production Performance

<2s

Search latency (p95)

94.0%

Benchmark accuracy

Millions

Documents indexed

100%

Citation grounded

4. Dataset & Benchmark Design

Benchmark Question Distribution by Source

297

29%

SDS

Chemical hazards & PPE

218

21.3%

STANDARD

ANSI/OSHA standards

211

20.6%

REGULATION

State/federal regs

139

13.6%

MANUAL

Equipment O&M

7.5%

TRAINING

Instructional content

Plus 5 other sources: Safety Alerts (27), Reports (24), Forms (10), Policies (10), Other (10)

Question Generation Pipeline

📝

Natural-language question

Worker-phrased safety queries

✓

Expected answer

Document-grounded response

🔢

Multiple-choice (4 options)

Plausible distractors

🏷️

11 question categories

Specification, compliance, hazards, etc.

📍

Jurisdictional metadata

State/province tagging where relevant

✨

LLM-based generation

Validated for suitability & accuracy

Question Type Distribution

1,023

Total Questions

11 types across 10 document sources

Top Categories

Specification31.7%

324

Compliance22.5%

230

What Hazards15.6%

160

How To9.3%

When Required5.7%

Definition5.2%

Plus 5 more types: Emergency (31), What PPE (28), Who Responsible (23), Inspection (15), Incident (6)

Example Question

{
  "mc_question": "In Michigan, what is the required service interval and periodic test voltage for a fiberglass live-line tool used for primary employee protection?",
  "mc_options": [
    {
      "label": "a",
      "text": "Every year, tested at 50,000 volts per foot for 1 minute.",
      "is_correct": false
    },
    {
      "label": "b",
      "text": "Every 2 years, tested at 100,000 volts per foot for 5 minutes.",
      "is_correct": false
    },
    {
      "label": "c",
      "text": "Every year, tested at 100,000 volts per foot for 5 minutes.",
      "is_correct": false
    },
    {
      "label": "d",
      "text": "Every 2 years, tested at 75,000 volts per foot for 1 minute.",
      "is_correct": true
    }
  ],
  "mc_correct_answer": "d"
}

5. Evaluation Methodology

5.1 Benchmark Harness

Evaluation is performed with a dedicated script that:

Loads questions from JSON
Calls the SALUS IQ benchmark endpoint or model provider endpoint
Records selected answer, correctness, reasoning, retrieval stats
Computes metrics: accuracy, average confidence, retrieval usage

5.2 Comparative Baselines

We evaluate:

SALUS IQ (domain expert LLM system)
GPT‑5 (Api) (OpenAI)
Claude 4.1 Opus (Anthropic)
Claude 4.5 Sonnet (Anthropic)
Gemini 2.5 Pro (Google DeepMind)

All baselines receive identical multiple-choice prompts with zero-shot evaluation.

5.3 Metrics

Accuracy: Percentage of correct answers
Confidence calibration: Mean confidence split by correctness
Retrieval statistics: Average docs retrieved, percentage using context
Statistical significance: Bootstrap confidence intervals and McNemar's test (α=0.05)

6. Results

6.1 Overall Performance

System	Accuracy	95% CI	Notes
SALUS IQ (2025.10)	94.0%	[92.57%, 95.41%]	Hybrid RAG
GPT‑5 (Api)	80.3%	[77.81%, 82.70%]	Zero-shot
Claude 4.1 Opus	80.9%	[78.49%, 83.28%]	Zero-shot
Claude 4.5 Sonnet	75.1%	[72.43%, 77.71%]	Zero-shot
Gemini 2.5 Pro	79.0%	[76.44%, 81.43%]	Zero-shot

Table 1: Overall benchmark results on SALUS-SafetyQA (n=1,023 questions). SALUS IQ shows 13.78 pp improvement over GPT-5, 13.10 pp over Claude 4.1 Opus, 18.96 pp over Claude 4.5 Sonnet, and 15.05 pp over Gemini 2.5 Pro. All comparisons are statistically significant (p < 0.001, McNemar's test).

6.2 Performance Visualizations

**Figure 1**: Overall model accuracy comparison across all 1,023 questions.

Accuracy by question type — **Figure 2**: Model accuracy breakdown by question type. SALUS IQ shows consistent performance across all categories, while baseline models struggle particularly with specification questions (71.0% for GPT-5, 68.8% for Claude 4.1).

Accuracy by source type — **Figure 3**: Model accuracy breakdown by source document type. Baseline models show significant performance degradation on MANUAL documents (61.9% for GPT-5, 69.1% for Claude 4.1) compared to SALUS IQ (94.2%).

Confidence calibration — **Figure 4**: Confidence analysis by correctness. SALUS IQ demonstrates superior calibration with a 0.168 gap between correct and incorrect answers.

6.3 Statistical Significance

All pairwise comparisons between SALUS IQ and baseline models show highly significant differences (McNemar's test, p < 0.001):

Comparison	SALUS Wins	Baseline Wins	Net Advantage	p-value
vs GPT-5	171	30	+141	p < 0.001
vs Claude 4.1 Opus	163	29	+134	p < 0.001
vs Claude 4.5 Sonnet	225	31	+194	p < 0.001
vs Gemini 2.5 Pro	188	34	+154	p < 0.001

Table 2: McNemar's test results showing pairwise comparisons. On questions where models disagreed, SALUS IQ answered correctly in 84.9%–87.9% of cases (mean: 85.6%), demonstrating consistent superiority across all comparisons.

7. Analysis

Performance Improvements

+13.78pp

vs GPT-5

80.3%

+13.10pp

vs Claude 4.1

80.9%

+18.96pp

vs Claude 4.5

75.1%

+15.05pp

vs Gemini 2.5

79.0%

Key insight: SALUS IQ achieves 90%+ accuracy on all question types, with consistent performance across categories (including 92.0% on specifications).

Biggest Performance Gaps

Specification Questions

31.7% of benchmark

LLMs hallucinate plausible but incorrect numerical values—the highest failure mode

92.0%

SALUS IQ

71.0%

GPT-5

68.8%

Claude 4.1

64.2%

Claude 4.5

Equipment Manual Questions

Largest gap

Procedural knowledge shows 32pp performance gap—biggest weakness

SALUS IQ: 94.2%

GPT-5: 61.9% (-32.3pp)

Claude 4.1: 69.1% (-25.1pp)

Error Analysis

SALUS IQ Errors

Only 5.96% (61 questions)

Retrieval failures65.6%

Ambiguous specs19.7%

Edge cases14.7%

LLM Errors

Average 21.7%

Hallucinated values43%

General knowledge substitution31%

Overconfidence26%

Confidence Calibration

SALUS IQ: Well-Calibrated

0.985

When correct

0.817

When incorrect

0.168

Gap (best)

✓ Enables human-in-the-loop: flag predictions <0.85 for expert review

Baseline Models: Poor Calibration

Claude 4.5 Sonnet:0.061 gap (worst)

Gemini 2.5 Pro:0.136 gap

Narrow gaps = overconfident when wrong → unsuitable for safety-critical applications

8. Transparency & Data Availability

8.1 What We Publish

Full benchmark dataset (1,023 questions) with question text, options, correct answers, and metadata
Complete evaluation harness (evaluate_benchmark.py) and configuration files
Statistical analysis code with bootstrap confidence intervals and McNemar's tests

8.2 What We Do Not Publish

Source document content or page snippets (proprietary corpus)
Document titles and internal database identifiers
Raw document files or training data

Note: Questions are designed to be answerable with publicly available safety documentation (SDS, OSHA regulations, equipment manuals) to enable independent evaluation.

8.3 Data Availability

The SALUS-SafetyQA benchmark is released under CC-BY-4.0 license:

Dataset & Code: github.com/Salus-Technologies/SALUS-SafetyQA
License: Creative Commons Attribution 4.0 International

8.4 Limitations

English-heavy corpus: Incomplete coverage for non-US regulations
Domain shift risk: New standards and equipment revisions may not be represented
Multiple-choice format: May overestimate performance relative to open-ended extraction
Temporal constraints: Safety regulations and standards change over time

9. Conclusion

We present SALUS IQ, a domain-specialized construction safety AI system, and introduce the SALUS-SafetyQA Benchmark containing 1,023 expert-validated questions. Results show that tailored RAG pipelines significantly outperform general-purpose LLMs in high-stakes safety domains, with 13-19 percentage point improvements and 100% document grounding.

The benchmark reveals systematic failure modes in frontier LLMs: hallucination of plausible but incorrect specifications, poor performance on equipment manuals, and overconfidence calibration. These findings demonstrate the critical need for domain-specific systems in safety-critical applications.

Future work includes expanding jurisdictional coverage, incorporating multimodal inputs (drawings, diagrams), and developing open-ended evaluation protocols.

References

Lewis, P., et al. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. NeurIPS, 2020.
Guu, K., et al. REALM: Retrieval-Augmented Language Model Pre-Training. ICML, 2020.
Karpukhin, V., et al. Dense Passage Retrieval for Open-Domain Question Answering. EMNLP, 2020.
Khattab, O., & Zaharia, M. ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT. SIGIR, 2020.
Ma, X., et al. Query Rewriting for Retrieval-Augmented Large Language Models. arXiv:2305.14283, 2023.
Ma, X., et al. Fine-Tuning LLaMA for Multi-Stage Text Retrieval. arXiv:2310.08319, 2023.
Nogueira, R., & Cho, K. Passage Re-ranking with BERT. arXiv:1901.04085, 2019.
Singhal, K., et al. Towards Expert-Level Medical Question Answering with Large Language Models. arXiv:2305.09617, 2023.
Zheng, C., et al. A Reasoning-Focused Legal Retrieval Benchmark. arXiv:2505.03970, 2025.
Taylor, R., et al. Galactica: A Large Language Model for Science. arXiv:2211.09085, 2022.
Zhang, F., et al. A Hybrid Structured Deep Neural Network with Word2Vec for Construction Accident Causes Classification. Complex & Intelligent Systems, 2019.
Tixier, A., et al. Application of Machine Learning to Construction Injury Prediction. Automation in Construction, 2016.
Fang, Q., et al. Detecting Non-Hardhat Use by a Deep Learning Method from Far-Field Surveillance Videos. Automation in Construction, 2018.

Download & Access

Get the full paper, benchmark dataset, and evaluation code.

Download PDF Access Benchmark

Citation

@techreport{salus2025safetyqa,
  title={SALUS IQ: Technical White Paper & Benchmark Report},
  author={Alex Jacobs and Dany Ayvazov},
  institution={SALUS Safety},
  year={2025},
  month={October},
  note={Version 1.0}
}

Safety Assistant

RESEARCH

SALUS IQ: Technical White Paper & Benchmark Report

Alex Jacobs; Dany Ayvazov • SALUS Safety • October 2025

Performance on SALUS-SafetyQA

A comprehensive benchmark of 1,023 safety questions

SALUS IQ (2025.10)

94%

GPT‑5 (Api)

80.3%

13.78pp

Claude 4.1 Opus

80.9%

13.1pp

Claude 4.5 Sonnet

75.1%

18.96pp

Gemini 2.5 Pro

79%

15.05pp

Benchmark Questions

Document Grounded

Download PDF Access Benchmark

Try It Yourself

Sample Benchmark Question

Test your safety knowledge (1/11)

When using Corotech Polyamide Epoxy Primer Red CV150, what is the ACGIH recommended Time-Weighted Average (TWA) exposure limit for ethyl benzene?

Abstract

We present SALUS IQ, achieving 94.0% accuracy on construction safety queries— 13-19 percentage points better than GPT-5, Claude, and Gemini.

The Problem

Our Contributions

SALUS-SafetyQA

First comprehensive safety AI benchmark (1,023 questions)

13-19pp

Improvement over frontier LLMs (p < 0.001)

Failure Analysis

Systematic failure modes in general-purpose LLMs

RAGConstruction SafetyDomain AdaptationHybrid SearchBenchmark

1. Introduction

Why Construction Safety AI is Hard

Heterogeneous formats

SDS, manuals, regulations follow distinct structures requiring specialized parsing

High-stakes accuracy

Incorrect safety info can result in injuries, fatalities, legal liability

Temporal sensitivity

Regulations update frequently, requiring version-aware retrieval

Multi-jurisdictional complexity

State and federal regulations may conflict or complement each other

Why Existing Solutions Fail

⚠️

General-purpose LLMs

ChatGPT, Gemini, Claude hallucinate plausible but incorrect safety information

🔍

Traditional search

Returns relevant documents but requires manual extraction of specific answers

⚙️

Rule-based systems

Lack flexibility for natural language queries, require constant maintenance

1.2 Contributions

This paper makes the following contributions:

SALUS-SafetyQA benchmark - The first comprehensive evaluation benchmark for construction safety AI, containing 1,023 expert-validated multiple-choice questions across 11 question types and 10 document source types
Comprehensive comparative evaluation - Systematic evaluation of domain-specific RAG (SALUS IQ) versus frontier LLMs (GPT-5, Claude 4.1/4.5, Gemini 2.5 Pro) with rigorous statistical analysis demonstrating 13-19 percentage point improvements
Analysis of domain adaptation gaps - Identification of systematic failure modes in general-purpose LLMs including hallucination of specifications, poor performance on equipment manuals, and overconfidence calibration issues

3. The SALUS IQ System

Architecture: 6-Layer Retrieval Pipeline

Document processing

Type-aware parsing for SDS, regulations, standards, manuals

Hybrid search

Dense + sparse vector search with metadata filtering

Hybrid reranking

Multi-signal combining semantic similarity, keywords, learned rankers

Query expansion

Model-based decomposition into canonical search terms

Validation layer

Checks retrieved spans for query alignment

Answer synthesis

Safety-optimized prompt enforcing grounding & compliance

Key Innovations

Hybrid Retrieval

Combines dense vectors (semantic), BM25 (keyword), and learned reranking for robust document retrieval across heterogeneous safety docs

Document-Type Aware

Specialized parsing for SDS, regulations, equipment manuals, and standards—each requiring different extraction strategies

Safety-Optimized Prompting

Custom prompts that enforce citation of sources, prioritize accuracy over completeness, and flag uncertainty

Production Performance

<2s

Search latency (p95)

94.0%

Benchmark accuracy

Millions

Documents indexed

100%

Citation grounded

4. Dataset & Benchmark Design

Benchmark Question Distribution by Source

297

29%

SDS

Chemical hazards & PPE

218

21.3%

STANDARD

ANSI/OSHA standards

211

20.6%

REGULATION

State/federal regs

139

13.6%

MANUAL

Equipment O&M

7.5%

TRAINING

Instructional content

Plus 5 other sources: Safety Alerts (27), Reports (24), Forms (10), Policies (10), Other (10)

Question Generation Pipeline

📝

Natural-language question

Worker-phrased safety queries

✓

Expected answer

Document-grounded response

🔢

Multiple-choice (4 options)

Plausible distractors

🏷️

11 question categories

Specification, compliance, hazards, etc.

📍

Jurisdictional metadata

State/province tagging where relevant

✨

LLM-based generation

Validated for suitability & accuracy

Question Type Distribution

1,023

Total Questions

11 types across 10 document sources

Top Categories

Specification31.7%

324

Compliance22.5%

230

What Hazards15.6%

160

How To9.3%

When Required5.7%

Definition5.2%

Plus 5 more types: Emergency (31), What PPE (28), Who Responsible (23), Inspection (15), Incident (6)

Example Question

{
  "mc_question": "In Michigan, what is the required service interval and periodic test voltage for a fiberglass live-line tool used for primary employee protection?",
  "mc_options": [
    {
      "label": "a",
      "text": "Every year, tested at 50,000 volts per foot for 1 minute.",
      "is_correct": false
    },
    {
      "label": "b",
      "text": "Every 2 years, tested at 100,000 volts per foot for 5 minutes.",
      "is_correct": false
    },
    {
      "label": "c",
      "text": "Every year, tested at 100,000 volts per foot for 5 minutes.",
      "is_correct": false
    },
    {
      "label": "d",
      "text": "Every 2 years, tested at 75,000 volts per foot for 1 minute.",
      "is_correct": true
    }
  ],
  "mc_correct_answer": "d"
}

5. Evaluation Methodology

5.1 Benchmark Harness

Evaluation is performed with a dedicated script that:

Loads questions from JSON
Calls the SALUS IQ benchmark endpoint or model provider endpoint
Records selected answer, correctness, reasoning, retrieval stats
Computes metrics: accuracy, average confidence, retrieval usage

5.2 Comparative Baselines

We evaluate:

SALUS IQ (domain expert LLM system)
GPT‑5 (Api) (OpenAI)
Claude 4.1 Opus (Anthropic)
Claude 4.5 Sonnet (Anthropic)
Gemini 2.5 Pro (Google DeepMind)

All baselines receive identical multiple-choice prompts with zero-shot evaluation.

5.3 Metrics

Accuracy: Percentage of correct answers
Confidence calibration: Mean confidence split by correctness
Retrieval statistics: Average docs retrieved, percentage using context
Statistical significance: Bootstrap confidence intervals and McNemar's test (α=0.05)

6. Results

6.1 Overall Performance

System	Accuracy	95% CI	Notes
SALUS IQ (2025.10)	94.0%	[92.57%, 95.41%]	Hybrid RAG
GPT‑5 (Api)	80.3%	[77.81%, 82.70%]	Zero-shot
Claude 4.1 Opus	80.9%	[78.49%, 83.28%]	Zero-shot
Claude 4.5 Sonnet	75.1%	[72.43%, 77.71%]	Zero-shot
Gemini 2.5 Pro	79.0%	[76.44%, 81.43%]	Zero-shot

6.2 Performance Visualizations

6.3 Statistical Significance

All pairwise comparisons between SALUS IQ and baseline models show highly significant differences (McNemar's test, p < 0.001):

Comparison	SALUS Wins	Baseline Wins	Net Advantage	p-value
vs GPT-5	171	30	+141	p < 0.001
vs Claude 4.1 Opus	163	29	+134	p < 0.001
vs Claude 4.5 Sonnet	225	31	+194	p < 0.001
vs Gemini 2.5 Pro	188	34	+154	p < 0.001

7. Analysis

Performance Improvements

+13.78pp

vs GPT-5

80.3%

+13.10pp

vs Claude 4.1

80.9%

+18.96pp

vs Claude 4.5

75.1%

+15.05pp

vs Gemini 2.5

79.0%

Key insight: SALUS IQ achieves 90%+ accuracy on all question types, with consistent performance across categories (including 92.0% on specifications).

Biggest Performance Gaps

Specification Questions

31.7% of benchmark

LLMs hallucinate plausible but incorrect numerical values—the highest failure mode

92.0%

SALUS IQ

71.0%

GPT-5

68.8%

Claude 4.1

64.2%

Claude 4.5

Equipment Manual Questions

Largest gap

Procedural knowledge shows 32pp performance gap—biggest weakness

SALUS IQ: 94.2%

GPT-5: 61.9% (-32.3pp)

Claude 4.1: 69.1% (-25.1pp)

Error Analysis

SALUS IQ Errors

Only 5.96% (61 questions)

Retrieval failures65.6%

Ambiguous specs19.7%

Edge cases14.7%

LLM Errors

Average 21.7%

Hallucinated values43%

General knowledge substitution31%

Overconfidence26%

Confidence Calibration

SALUS IQ: Well-Calibrated

0.985

When correct

0.817

When incorrect

0.168

Gap (best)

✓ Enables human-in-the-loop: flag predictions <0.85 for expert review

Baseline Models: Poor Calibration

Claude 4.5 Sonnet:0.061 gap (worst)

Gemini 2.5 Pro:0.136 gap

Narrow gaps = overconfident when wrong → unsuitable for safety-critical applications

8. Transparency & Data Availability

8.1 What We Publish

Full benchmark dataset (1,023 questions) with question text, options, correct answers, and metadata
Complete evaluation harness (evaluate_benchmark.py) and configuration files
Statistical analysis code with bootstrap confidence intervals and McNemar's tests

8.2 What We Do Not Publish

Source document content or page snippets (proprietary corpus)
Document titles and internal database identifiers
Raw document files or training data

Note: Questions are designed to be answerable with publicly available safety documentation (SDS, OSHA regulations, equipment manuals) to enable independent evaluation.

8.3 Data Availability

The SALUS-SafetyQA benchmark is released under CC-BY-4.0 license:

Dataset & Code: github.com/Salus-Technologies/SALUS-SafetyQA
License: Creative Commons Attribution 4.0 International

8.4 Limitations

English-heavy corpus: Incomplete coverage for non-US regulations
Domain shift risk: New standards and equipment revisions may not be represented
Multiple-choice format: May overestimate performance relative to open-ended extraction
Temporal constraints: Safety regulations and standards change over time

9. Conclusion

Future work includes expanding jurisdictional coverage, incorporating multimodal inputs (drawings, diagrams), and developing open-ended evaluation protocols.

References

Lewis, P., et al. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. NeurIPS, 2020.
Guu, K., et al. REALM: Retrieval-Augmented Language Model Pre-Training. ICML, 2020.
Karpukhin, V., et al. Dense Passage Retrieval for Open-Domain Question Answering. EMNLP, 2020.
Khattab, O., & Zaharia, M. ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT. SIGIR, 2020.
Ma, X., et al. Query Rewriting for Retrieval-Augmented Large Language Models. arXiv:2305.14283, 2023.
Ma, X., et al. Fine-Tuning LLaMA for Multi-Stage Text Retrieval. arXiv:2310.08319, 2023.
Nogueira, R., & Cho, K. Passage Re-ranking with BERT. arXiv:1901.04085, 2019.
Singhal, K., et al. Towards Expert-Level Medical Question Answering with Large Language Models. arXiv:2305.09617, 2023.
Zheng, C., et al. A Reasoning-Focused Legal Retrieval Benchmark. arXiv:2505.03970, 2025.
Taylor, R., et al. Galactica: A Large Language Model for Science. arXiv:2211.09085, 2022.
Zhang, F., et al. A Hybrid Structured Deep Neural Network with Word2Vec for Construction Accident Causes Classification. Complex & Intelligent Systems, 2019.
Tixier, A., et al. Application of Machine Learning to Construction Injury Prediction. Automation in Construction, 2016.
Fang, Q., et al. Detecting Non-Hardhat Use by a Deep Learning Method from Far-Field Surveillance Videos. Automation in Construction, 2018.

Download & Access

Get the full paper, benchmark dataset, and evaluation code.

Download PDF Access Benchmark

Citation

@techreport{salus2025safetyqa,
  title={SALUS IQ: Technical White Paper & Benchmark Report},
  author={Alex Jacobs and Dany Ayvazov},
  institution={SALUS Safety},
  year={2025},
  month={October},
  note={Version 1.0}
}

SALUS IQ: Technical White Paper & Benchmark Report

Performance on SALUS-SafetyQA

Sample Benchmark Question

Abstract

The Problem

Our Contributions

1. Introduction

Why Construction Safety AI is Hard

Why Existing Solutions Fail

1.2 Contributions

2. Related Work

2.1 Retrieval-Augmented Generation

2.2 Domain-Specific AI Systems

2.3 Safety AI Benchmarks

3. The SALUS IQ System

Architecture: 6-Layer Retrieval Pipeline

Key Innovations

Production Performance

4. Dataset & Benchmark Design

Benchmark Question Distribution by Source

Question Generation Pipeline

Question Type Distribution

Top Categories

Example Question

5. Evaluation Methodology

5.1 Benchmark Harness

5.2 Comparative Baselines

5.3 Metrics

6. Results

6.1 Overall Performance

6.2 Performance Visualizations

6.3 Statistical Significance

7. Analysis

Performance Improvements

Biggest Performance Gaps

Specification Questions

Equipment Manual Questions

Error Analysis

SALUS IQ Errors

LLM Errors

Confidence Calibration

SALUS IQ: Well-Calibrated

Baseline Models: Poor Calibration

Example Error Cases

8. Transparency & Data Availability

8.1 What We Publish

8.2 What We Do Not Publish

8.3 Data Availability

8.4 Limitations

9. Conclusion

References

Download & Access

Citation

SALUS IQ: Technical White Paper & Benchmark Report

Performance on SALUS-SafetyQA

Sample Benchmark Question

Abstract

The Problem

Our Contributions

1. Introduction

Why Construction Safety AI is Hard

Why Existing Solutions Fail

1.2 Contributions

2. Related Work

2.1 Retrieval-Augmented Generation

2.2 Domain-Specific AI Systems

2.3 Safety AI Benchmarks

3. The SALUS IQ System

Architecture: 6-Layer Retrieval Pipeline

Key Innovations

Production Performance

4. Dataset & Benchmark Design

Benchmark Question Distribution by Source

Question Generation Pipeline

Question Type Distribution

Top Categories

Example Question

5. Evaluation Methodology

5.1 Benchmark Harness

5.2 Comparative Baselines