Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions docs/ref/checks/custom_prompt_check.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,11 @@ Implements custom content checks using configurable LLM prompts. Uses your custo
- **`model`** (required): Model to use for the check (e.g., "gpt-5")
- **`confidence_threshold`** (required): Minimum confidence score to trigger tripwire (0.0 to 1.0)
- **`system_prompt_details`** (required): Custom instructions defining the content detection criteria
- **`include_reasoning`** (optional): Whether to include reasoning/explanation fields in the guardrail output (default: `false`)
- When `false`: The LLM only generates the essential fields (`flagged` and `confidence`), reducing token generation costs
- When `true`: Additionally, returns detailed reasoning for its decisions
- **Performance**: In our evaluations, disabling reasoning reduces median latency by 40% on average (ranging from 18% to 67% depending on model) while maintaining detection performance
- **Use Case**: Keep disabled for production to minimize costs and latency; enable for development and debugging

## Implementation Notes

Expand All @@ -42,3 +47,4 @@ Returns a `GuardrailResult` with the following `info` dictionary:
- **`flagged`**: Whether the custom validation criteria were met
- **`confidence`**: Confidence score (0.0 to 1.0) for the validation
- **`threshold`**: The confidence threshold that was configured
- **`reason`**: Explanation of why the input was flagged (or not flagged) - *only included when `include_reasoning=true`*
24 changes: 16 additions & 8 deletions docs/ref/checks/hallucination_detection.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,8 @@ Flags model text containing factual claims that are clearly contradicted or not
"config": {
"model": "gpt-4.1-mini",
"confidence_threshold": 0.7,
"knowledge_source": "vs_abc123"
"knowledge_source": "vs_abc123",
"include_reasoning": false
}
}
```
Expand All @@ -24,6 +25,11 @@ Flags model text containing factual claims that are clearly contradicted or not
- **`model`** (required): OpenAI model (required) to use for validation (e.g., "gpt-4.1-mini")
- **`confidence_threshold`** (required): Minimum confidence score to trigger tripwire (0.0 to 1.0)
- **`knowledge_source`** (required): OpenAI vector store ID starting with "vs_" containing reference documents
- **`include_reasoning`** (optional): Whether to include detailed reasoning fields in the output (default: `false`)
- When `false`: Returns only `flagged` and `confidence` to save tokens
- When `true`: Additionally, returns `reasoning`, `hallucination_type`, `hallucinated_statements`, and `verified_statements`
- **Performance**: In our evaluations, disabling reasoning reduces median latency by 40% on average (ranging from 18% to 67% depending on model) while maintaining detection performance
- **Use Case**: Keep disabled for production to minimize costs and latency; enable for development and debugging

### Tuning guidance

Expand Down Expand Up @@ -102,7 +108,9 @@ See [`examples/hallucination_detection/`](https://github.com/openai/openai-guard

## What It Returns

Returns a `GuardrailResult` with the following `info` dictionary:
Returns a `GuardrailResult` with the following `info` dictionary.

**With `include_reasoning=true`:**

```json
{
Expand All @@ -117,15 +125,15 @@ Returns a `GuardrailResult` with the following `info` dictionary:
}
```

### Fields

- **`flagged`**: Whether the content was flagged as potentially hallucinated
- **`confidence`**: Confidence score (0.0 to 1.0) for the detection
- **`reasoning`**: Explanation of why the content was flagged
- **`hallucination_type`**: Type of issue detected (e.g., "factual_error", "unsupported_claim")
- **`hallucinated_statements`**: Specific statements that are contradicted or unsupported
- **`verified_statements`**: Statements that are supported by your documents
- **`threshold`**: The confidence threshold that was configured

Tip: `hallucination_type` is typically one of `factual_error`, `unsupported_claim`, or `none`.
- **`reasoning`**: Explanation of why the content was flagged - *only included when `include_reasoning=true`*
- **`hallucination_type`**: Type of issue detected (e.g., "factual_error", "unsupported_claim", "none") - *only included when `include_reasoning=true`*
- **`hallucinated_statements`**: Specific statements that are contradicted or unsupported - *only included when `include_reasoning=true`*
- **`verified_statements`**: Statements that are supported by your documents - *only included when `include_reasoning=true`*

## Benchmark Results

Expand Down
10 changes: 8 additions & 2 deletions docs/ref/checks/jailbreak.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,8 @@ Detects attempts to bypass safety or policy constraints via manipulation (prompt
"name": "Jailbreak",
"config": {
"model": "gpt-4.1-mini",
"confidence_threshold": 0.7
"confidence_threshold": 0.7,
"include_reasoning": false
}
}
```
Expand All @@ -42,6 +43,11 @@ Detects attempts to bypass safety or policy constraints via manipulation (prompt

- **`model`** (required): Model to use for detection (e.g., "gpt-4.1-mini")
- **`confidence_threshold`** (required): Minimum confidence score to trigger tripwire (0.0 to 1.0)
- **`include_reasoning`** (optional): Whether to include reasoning/explanation fields in the guardrail output (default: `false`)
- When `false`: The LLM only generates the essential fields (`flagged` and `confidence`), reducing token generation costs
- When `true`: Additionally, returns detailed reasoning for its decisions
- **Performance**: In our evaluations, disabling reasoning reduces median latency by 40% on average (ranging from 18% to 67% depending on model) while maintaining detection performance
- **Use Case**: Keep disabled for production to minimize costs and latency; enable for development and debugging

### Tuning guidance

Expand Down Expand Up @@ -70,7 +76,7 @@ Returns a `GuardrailResult` with the following `info` dictionary:
- **`flagged`**: Whether a jailbreak attempt was detected
- **`confidence`**: Confidence score (0.0 to 1.0) for the detection
- **`threshold`**: The confidence threshold that was configured
- **`reason`**: Explanation of why the input was flagged (or not flagged)
- **`reason`**: Explanation of why the input was flagged (or not flagged) - *only included when `include_reasoning=true`*
- **`used_conversation_history`**: Boolean indicating whether conversation history was analyzed
- **`checked_text`**: JSON payload containing the conversation history and latest input that was analyzed

Expand Down
8 changes: 7 additions & 1 deletion docs/ref/checks/llm_base.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,8 @@ Base configuration for LLM-based guardrails. Provides common configuration optio
"name": "LLM Base",
"config": {
"model": "gpt-5",
"confidence_threshold": 0.7
"confidence_threshold": 0.7,
"include_reasoning": false
}
}
```
Expand All @@ -18,6 +19,11 @@ Base configuration for LLM-based guardrails. Provides common configuration optio

- **`model`** (required): OpenAI model to use for the check (e.g., "gpt-5")
- **`confidence_threshold`** (required): Minimum confidence score to trigger tripwire (0.0 to 1.0)
- **`include_reasoning`** (optional): Whether to include reasoning/explanation fields in the guardrail output (default: `false`)
- When `true`: The LLM generates and returns detailed reasoning for its decisions (e.g., `reason`, `reasoning`, `observation`, `evidence` fields)
- When `false`: The LLM only returns the essential fields (`flagged` and `confidence`), reducing token generation costs
- **Performance**: In our evaluations, disabling reasoning reduces median latency by 40% on average (ranging from 18% to 67% depending on model) while maintaining detection performance
- **Use Case**: Keep disabled for production to minimize costs and latency; enable for development and debugging

## What It Does

Expand Down
6 changes: 6 additions & 0 deletions docs/ref/checks/nsfw.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,11 @@ Flags workplace‑inappropriate model outputs: explicit sexual content, profanit

- **`model`** (required): Model to use for detection (e.g., "gpt-4.1-mini")
- **`confidence_threshold`** (required): Minimum confidence score to trigger tripwire (0.0 to 1.0)
- **`include_reasoning`** (optional): Whether to include reasoning/explanation fields in the guardrail output (default: `false`)
- When `false`: The LLM only generates the essential fields (`flagged` and `confidence`), reducing token generation costs
- When `true`: Additionally, returns detailed reasoning for its decisions
- **Performance**: In our evaluations, disabling reasoning reduces median latency by 40% on average (ranging from 18% to 67% depending on model) while maintaining detection performance
- **Use Case**: Keep disabled for production to minimize costs and latency; enable for development and debugging

### Tuning guidance

Expand All @@ -51,6 +56,7 @@ Returns a `GuardrailResult` with the following `info` dictionary:
- **`flagged`**: Whether NSFW content was detected
- **`confidence`**: Confidence score (0.0 to 1.0) for the detection
- **`threshold`**: The confidence threshold that was configured
- **`reason`**: Explanation of why the input was flagged (or not flagged) - *only included when `include_reasoning=true`*

### Examples

Expand Down
10 changes: 8 additions & 2 deletions docs/ref/checks/off_topic_prompts.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,11 @@ Ensures content stays within defined business scope using LLM analysis. Flags co
- **`model`** (required): Model to use for analysis (e.g., "gpt-5")
- **`confidence_threshold`** (required): Minimum confidence score to trigger tripwire (0.0 to 1.0)
- **`system_prompt_details`** (required): Description of your business scope and acceptable topics
- **`include_reasoning`** (optional): Whether to include reasoning/explanation fields in the guardrail output (default: `false`)
- When `false`: The LLM only generates the essential fields (`flagged` and `confidence`), reducing token generation costs
- When `true`: Additionally, returns detailed reasoning for its decisions
- **Performance**: In our evaluations, disabling reasoning reduces median latency by 40% on average (ranging from 18% to 67% depending on model) while maintaining detection performance
- **Use Case**: Keep disabled for production to minimize costs and latency; enable for development and debugging

## Implementation Notes

Expand All @@ -39,6 +44,7 @@ Returns a `GuardrailResult` with the following `info` dictionary:
}
```

- **`flagged`**: Whether the content aligns with your business scope
- **`confidence`**: Confidence score (0.0 to 1.0) for the prompt injection detection assessment
- **`flagged`**: Whether the content is off-topic (outside your business scope)
- **`confidence`**: Confidence score (0.0 to 1.0) for the assessment
- **`threshold`**: The confidence threshold that was configured
- **`reason`**: Explanation of why the input was flagged (or not flagged) - *only included when `include_reasoning=true`*
13 changes: 11 additions & 2 deletions docs/ref/checks/prompt_injection_detection.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,8 @@ After tool execution, the prompt injection detection check validates that the re
"name": "Prompt Injection Detection",
"config": {
"model": "gpt-4.1-mini",
"confidence_threshold": 0.7
"confidence_threshold": 0.7,
"include_reasoning": false
}
}
```
Expand All @@ -40,6 +41,11 @@ After tool execution, the prompt injection detection check validates that the re

- **`model`** (required): Model to use for prompt injection detection analysis (e.g., "gpt-4.1-mini")
- **`confidence_threshold`** (required): Minimum confidence score to trigger tripwire (0.0 to 1.0)
- **`include_reasoning`** (optional): Whether to include the `observation` and `evidence` fields in the output (default: `false`)
- When `true`: Returns detailed `observation` explaining what the action is doing and `evidence` with specific quotes/details
- When `false`: Omits reasoning fields to save tokens (typically 100-300 tokens per check)
- **Performance**: In our evaluations, disabling reasoning reduces median latency by 40% on average (ranging from 18% to 67% depending on model) while maintaining detection performance
- **Use Case**: Keep disabled for production to minimize costs and latency; enable for development and debugging

**Flags as MISALIGNED:**

Expand Down Expand Up @@ -77,13 +83,16 @@ Returns a `GuardrailResult` with the following `info` dictionary:
}
```

- **`observation`**: What the AI action is doing
- **`observation`**: What the AI action is doing - *only included when `include_reasoning=true`*
- **`flagged`**: Whether the action is misaligned (boolean)
- **`confidence`**: Confidence score (0.0 to 1.0) that the action is misaligned
- **`evidence`**: Specific evidence from conversation supporting the decision - *only included when `include_reasoning=true`*
- **`threshold`**: The confidence threshold that was configured
- **`user_goal`**: The tracked user intent from conversation
- **`action`**: The list of function calls or tool outputs analyzed for alignment

**Note**: When `include_reasoning=false` (the default), the `observation` and `evidence` fields are omitted to reduce token generation costs.

## Benchmark Results

### Dataset Description
Expand Down
60 changes: 37 additions & 23 deletions src/guardrails/checks/text/hallucination_detection.py
Original file line number Diff line number Diff line change
Expand Up @@ -94,8 +94,8 @@ class HallucinationDetectionOutput(LLMOutput):
Extends the base LLM output with hallucination-specific details.
Attributes:
flagged (bool): Whether the content was flagged as potentially hallucinated.
confidence (float): Confidence score (0.0 to 1.0) that the input is hallucinated.
flagged (bool): Whether the content was flagged as potentially hallucinated (inherited).
confidence (float): Confidence score (0.0 to 1.0) that the input is hallucinated (inherited).
reasoning (str): Detailed explanation of the analysis.
hallucination_type (str | None): Type of hallucination detected.
hallucinated_statements (list[str] | None): Specific statements flagged as
Expand All @@ -104,16 +104,6 @@ class HallucinationDetectionOutput(LLMOutput):
by the documents.
"""

flagged: bool = Field(
...,
description="Indicates whether the content was flagged as potentially hallucinated.",
)
confidence: float = Field(
...,
description="Confidence score (0.0 to 1.0) that the input is hallucinated.",
ge=0.0,
le=1.0,
)
reasoning: str = Field(
...,
description="Detailed explanation of the hallucination analysis.",
Expand Down Expand Up @@ -184,14 +174,6 @@ class HallucinationDetectionOutput(LLMOutput):
3. **Clearly contradicted by the documents** - Claims that directly contradict the documents → FLAG
4. **Completely unsupported by the documents** - Claims that cannot be verified from the documents → FLAG
Respond with a JSON object containing:
- "flagged": boolean (true if ANY factual claims are clearly contradicted or completely unsupported)
- "confidence": float (0.0 to 1.0, your confidence that the input is hallucinated)
- "reasoning": string (detailed explanation of your analysis)
- "hallucination_type": string (type of issue, if detected: "factual_error", "unsupported_claim", or "none" if supported)
- "hallucinated_statements": array of strings (specific factual statements that may be hallucinated)
- "verified_statements": array of strings (specific factual statements that are supported by the documents)
**CRITICAL GUIDELINES**:
- Flag content if ANY factual claims are unsupported or contradicted (even if some claims are supported)
- Allow conversational, opinion-based, or general content to pass through
Expand All @@ -206,6 +188,30 @@ class HallucinationDetectionOutput(LLMOutput):
).strip()


# Instruction for output format when reasoning is enabled
REASONING_OUTPUT_INSTRUCTION = textwrap.dedent(
"""
Respond with a JSON object containing:
- "flagged": boolean (true if ANY factual claims are clearly contradicted or completely unsupported)
- "confidence": float (0.0 to 1.0, your confidence that the input is hallucinated)
- "reasoning": string (detailed explanation of your analysis)
- "hallucination_type": string (type of issue, if detected: "factual_error", "unsupported_claim", or "none" if supported)
- "hallucinated_statements": array of strings (specific factual statements that may be hallucinated)
- "verified_statements": array of strings (specific factual statements that are supported by the documents)
"""
).strip()


# Instruction for output format when reasoning is disabled
BASE_OUTPUT_INSTRUCTION = textwrap.dedent(
"""
Respond with a JSON object containing:
- "flagged": boolean (true if ANY factual claims are clearly contradicted or completely unsupported)
- "confidence": float (0.0 to 1.0, your confidence that the input is hallucinated)
"""
).strip()


async def hallucination_detection(
ctx: GuardrailLLMContextProto,
candidate: str,
Expand Down Expand Up @@ -242,15 +248,23 @@ async def hallucination_detection(
)

try:
# Create the validation query
validation_query = f"{VALIDATION_PROMPT}\n\nText to validate:\n{candidate}"
# Build the prompt based on whether reasoning is requested
if config.include_reasoning:
output_instruction = REASONING_OUTPUT_INSTRUCTION
output_format = HallucinationDetectionOutput
else:
output_instruction = BASE_OUTPUT_INSTRUCTION
output_format = LLMOutput

# Create the validation query with appropriate output instructions
validation_query = f"{VALIDATION_PROMPT}\n\n{output_instruction}\n\nText to validate:\n{candidate}"

# Use the Responses API with file search and structured output
response = await _invoke_openai_callable(
ctx.guardrail_llm.responses.parse,
input=validation_query,
model=config.model,
text_format=HallucinationDetectionOutput,
text_format=output_format,
tools=[{"type": "file_search", "vector_store_ids": [config.knowledge_source]}],
)

Expand Down
17 changes: 5 additions & 12 deletions src/guardrails/checks/text/jailbreak.py
Original file line number Diff line number Diff line change
Expand Up @@ -40,8 +40,6 @@
import textwrap
from typing import Any

from pydantic import Field

from guardrails.registry import default_spec_registry
from guardrails.spec import GuardrailSpecMetadata
from guardrails.types import GuardrailLLMContextProto, GuardrailResult, token_usage_to_dict
Expand All @@ -50,6 +48,7 @@
LLMConfig,
LLMErrorOutput,
LLMOutput,
LLMReasoningOutput,
create_error_result,
run_llm,
)
Expand Down Expand Up @@ -226,15 +225,6 @@
MAX_CONTEXT_TURNS = 10


class JailbreakLLMOutput(LLMOutput):
"""LLM output schema including rationale for jailbreak classification."""

reason: str = Field(
...,
description=("Justification for why the input was flagged or not flagged as a jailbreak."),
)


def _build_analysis_payload(conversation_history: list[Any] | None, latest_input: str) -> str:
"""Return a JSON payload with recent turns and the latest input."""
trimmed_input = latest_input.strip()
Expand All @@ -251,12 +241,15 @@ async def jailbreak(ctx: GuardrailLLMContextProto, data: str, config: LLMConfig)
conversation_history = getattr(ctx, "get_conversation_history", lambda: None)() or []
analysis_payload = _build_analysis_payload(conversation_history, data)

# Use LLMReasoningOutput (with reason) if reasoning is enabled, otherwise use base LLMOutput
output_model = LLMReasoningOutput if config.include_reasoning else LLMOutput

analysis, token_usage = await run_llm(
analysis_payload,
SYSTEM_PROMPT,
ctx.guardrail_llm,
config.model,
JailbreakLLMOutput,
output_model,
)

if isinstance(analysis, LLMErrorOutput):
Expand Down
Loading
Loading