Skip to content
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 7 additions & 1 deletion docs/ref/checks/custom_prompt_check.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,8 @@ Implements custom content checks using configurable LLM prompts. Uses your custo
"config": {
"model": "gpt-5",
"confidence_threshold": 0.7,
"system_prompt_details": "Determine if the user's request needs to be escalated to a senior support agent. Indications of escalation include: ..."
"system_prompt_details": "Determine if the user's request needs to be escalated to a senior support agent. Indications of escalation include: ...",
"include_reasoning": false
}
}
```
Expand All @@ -20,6 +21,10 @@ Implements custom content checks using configurable LLM prompts. Uses your custo
- **`model`** (required): Model to use for the check (e.g., "gpt-5")
- **`confidence_threshold`** (required): Minimum confidence score to trigger tripwire (0.0 to 1.0)
- **`system_prompt_details`** (required): Custom instructions defining the content detection criteria
- **`include_reasoning`** (optional): Whether to include reasoning/explanation fields in the guardrail output (default: `false`)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we include something about how this influences classifier performance?

- When `false`: The LLM only generates the essential fields (`flagged` and `confidence`), reducing token generation costs
- When `true`: Additionally, returns detailed reasoning for its decisions
- **Use Case**: Keep disabled for production to minimize costs; enable for development and debugging

## Implementation Notes

Expand All @@ -42,3 +47,4 @@ Returns a `GuardrailResult` with the following `info` dictionary:
- **`flagged`**: Whether the custom validation criteria were met
- **`confidence`**: Confidence score (0.0 to 1.0) for the validation
- **`threshold`**: The confidence threshold that was configured
- **`reason`**: Explanation of why the input was flagged (or not flagged) - *only included when `include_reasoning=true`*
23 changes: 15 additions & 8 deletions docs/ref/checks/hallucination_detection.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,8 @@ Flags model text containing factual claims that are clearly contradicted or not
"config": {
"model": "gpt-4.1-mini",
"confidence_threshold": 0.7,
"knowledge_source": "vs_abc123"
"knowledge_source": "vs_abc123",
"include_reasoning": false
}
}
```
Expand All @@ -24,6 +25,10 @@ Flags model text containing factual claims that are clearly contradicted or not
- **`model`** (required): OpenAI model (required) to use for validation (e.g., "gpt-4.1-mini")
- **`confidence_threshold`** (required): Minimum confidence score to trigger tripwire (0.0 to 1.0)
- **`knowledge_source`** (required): OpenAI vector store ID starting with "vs_" containing reference documents
- **`include_reasoning`** (optional): Whether to include detailed reasoning fields in the output (default: `false`)
- When `false`: Returns only `flagged` and `confidence` to save tokens
- When `true`: Additionally, returns `reasoning`, `hallucination_type`, `hallucinated_statements`, and `verified_statements`
- Recommended: Keep disabled for production (default); enable for development/debugging

### Tuning guidance

Expand Down Expand Up @@ -103,7 +108,9 @@ See [`examples/`](https://github.com/openai/openai-guardrails-js/tree/main/examp

## What It Returns

Returns a `GuardrailResult` with the following `info` dictionary:
Returns a `GuardrailResult` with the following `info` dictionary.

**With `include_reasoning=true`:**

```json
{
Expand All @@ -118,15 +125,15 @@ Returns a `GuardrailResult` with the following `info` dictionary:
}
```

### Fields

- **`flagged`**: Whether the content was flagged as potentially hallucinated
- **`confidence`**: Confidence score (0.0 to 1.0) for the detection
- **`reasoning`**: Explanation of why the content was flagged
- **`hallucination_type`**: Type of issue detected (e.g., "factual_error", "unsupported_claim")
- **`hallucinated_statements`**: Specific statements that are contradicted or unsupported
- **`verified_statements`**: Statements that are supported by your documents
- **`threshold`**: The confidence threshold that was configured

Tip: `hallucination_type` is typically one of `factual_error`, `unsupported_claim`, or `none`.
- **`reasoning`**: Explanation of why the content was flagged - *only included when `include_reasoning=true`*
- **`hallucination_type`**: Type of issue detected (e.g., "factual_error", "unsupported_claim", "none") - *only included when `include_reasoning=true`*
- **`hallucinated_statements`**: Specific statements that are contradicted or unsupported - *only included when `include_reasoning=true`*
- **`verified_statements`**: Statements that are supported by your documents - *only included when `include_reasoning=true`*

## Benchmark Results

Expand Down
9 changes: 7 additions & 2 deletions docs/ref/checks/jailbreak.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,8 @@ Detects attempts to bypass safety or policy constraints via manipulation (prompt
"name": "Jailbreak",
"config": {
"model": "gpt-4.1-mini",
"confidence_threshold": 0.7
"confidence_threshold": 0.7,
"include_reasoning": false
}
}
```
Expand All @@ -42,6 +43,10 @@ Detects attempts to bypass safety or policy constraints via manipulation (prompt

- **`model`** (required): Model to use for detection (e.g., "gpt-4.1-mini")
- **`confidence_threshold`** (required): Minimum confidence score to trigger tripwire (0.0 to 1.0)
- **`include_reasoning`** (optional): Whether to include reasoning/explanation fields in the guardrail output (default: `false`)
- When `false`: The LLM only generates the essential fields (`flagged` and `confidence`), reducing token generation costs
- When `true`: Additionally, returns detailed reasoning for its decisions
- **Use Case**: Keep disabled for production to minimize costs; enable for development and debugging

### Tuning guidance

Expand All @@ -68,7 +73,7 @@ Returns a `GuardrailResult` with the following `info` dictionary:
- **`flagged`**: Whether a jailbreak attempt was detected
- **`confidence`**: Confidence score (0.0 to 1.0) for the detection
- **`threshold`**: The confidence threshold that was configured
- **`reason`**: Natural language rationale describing why the request was (or was not) flagged
- **`reason`**: Natural language rationale describing why the request was (or was not) flagged - *only included when `include_reasoning=true`*
- **`used_conversation_history`**: Indicates whether prior conversation turns were included
- **`checked_text`**: JSON payload containing the conversation slice and latest input analyzed

Expand Down
7 changes: 6 additions & 1 deletion docs/ref/checks/llm_base.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,8 @@ Base configuration for LLM-based guardrails. Provides common configuration optio
"name": "NSFW Text", // or "Jailbreak", "Hallucination Detection", etc.
"config": {
"model": "gpt-5",
"confidence_threshold": 0.7
"confidence_threshold": 0.7,
"include_reasoning": false
}
}
```
Expand All @@ -20,6 +21,10 @@ Base configuration for LLM-based guardrails. Provides common configuration optio

- **`model`** (required): OpenAI model to use for the check (e.g., "gpt-5")
- **`confidence_threshold`** (required): Minimum confidence score to trigger tripwire (0.0 to 1.0)
- **`include_reasoning`** (optional): Whether to include reasoning/explanation fields in the guardrail output (default: `false`)
- When `false`: The LLM only generates the essential fields (`flagged` and `confidence`), reducing token generation costs
- When `true`: Additionally, returns detailed reasoning for its decisions
- **Use Case**: Keep disabled for production to minimize costs; enable for development and debugging

## What It Does

Expand Down
8 changes: 7 additions & 1 deletion docs/ref/checks/nsfw.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,8 @@ Flags workplace‑inappropriate model outputs: explicit sexual content, profanit
"name": "NSFW Text",
"config": {
"model": "gpt-4.1-mini",
"confidence_threshold": 0.7
"confidence_threshold": 0.7,
"include_reasoning": false
}
}
```
Expand All @@ -29,6 +30,10 @@ Flags workplace‑inappropriate model outputs: explicit sexual content, profanit

- **`model`** (required): Model to use for detection (e.g., "gpt-4.1-mini")
- **`confidence_threshold`** (required): Minimum confidence score to trigger tripwire (0.0 to 1.0)
- **`include_reasoning`** (optional): Whether to include reasoning/explanation fields in the guardrail output (default: `false`)
- When `false`: The LLM only generates the essential fields (`flagged` and `confidence`), reducing token generation costs
- When `true`: Additionally, returns detailed reasoning for its decisions
- **Use Case**: Keep disabled for production to minimize costs; enable for development and debugging

### Tuning guidance

Expand All @@ -51,6 +56,7 @@ Returns a `GuardrailResult` with the following `info` dictionary:
- **`flagged`**: Whether NSFW content was detected
- **`confidence`**: Confidence score (0.0 to 1.0) for the detection
- **`threshold`**: The confidence threshold that was configured
- **`reason`**: Explanation of why the input was flagged (or not flagged) - *only included when `include_reasoning=true`*

### Examples

Expand Down
13 changes: 9 additions & 4 deletions docs/ref/checks/off_topic_prompts.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,8 @@ Ensures content stays within defined business scope using LLM analysis. Flags co
"config": {
"model": "gpt-5",
"confidence_threshold": 0.7,
"system_prompt_details": "Customer support for our e-commerce platform. Topics include order status, returns, shipping, and product questions."
"system_prompt_details": "Customer support for our e-commerce platform. Topics include order status, returns, shipping, and product questions.",
"include_reasoning": false
}
}
```
Expand All @@ -20,6 +21,10 @@ Ensures content stays within defined business scope using LLM analysis. Flags co
- **`model`** (required): Model to use for analysis (e.g., "gpt-5")
- **`confidence_threshold`** (required): Minimum confidence score to trigger tripwire (0.0 to 1.0)
- **`system_prompt_details`** (required): Description of your business scope and acceptable topics
- **`include_reasoning`** (optional): Whether to include reasoning/explanation fields in the guardrail output (default: `false`)
- When `false`: The LLM only generates the essential fields (`flagged` and `confidence`), reducing token generation costs
- When `true`: Additionally, returns detailed reasoning for its decisions
- **Use Case**: Keep disabled for production to minimize costs; enable for development and debugging

## Implementation Notes

Expand All @@ -40,7 +45,7 @@ Returns a `GuardrailResult` with the following `info` dictionary:
}
```

- **`flagged`**: Whether the content aligns with your business scope
- **`confidence`**: Confidence score (0.0 to 1.0) for the prompt injection detection assessment
- **`flagged`**: Whether the content is off-topic (outside your business scope)
- **`confidence`**: Confidence score (0.0 to 1.0) for the assessment
- **`threshold`**: The confidence threshold that was configured
- **`business_scope`**: Copy of the scope provided in configuration
- **`reason`**: Explanation of why the input was flagged (or not flagged) - *only included when `include_reasoning=true`*
11 changes: 8 additions & 3 deletions docs/ref/checks/prompt_injection_detection.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,8 @@ After tool execution, the prompt injection detection check validates that the re
"name": "Prompt Injection Detection",
"config": {
"model": "gpt-4.1-mini",
"confidence_threshold": 0.7
"confidence_threshold": 0.7,
"include_reasoning": false
}
}
```
Expand All @@ -40,6 +41,10 @@ After tool execution, the prompt injection detection check validates that the re

- **`model`** (required): Model to use for prompt injection detection analysis (e.g., "gpt-4.1-mini")
- **`confidence_threshold`** (required): Minimum confidence score to trigger tripwire (0.0 to 1.0)
- **`include_reasoning`** (optional): Whether to include detailed reasoning fields (`observation` and `evidence`) in the output (default: `false`)
- When `false`: Returns only `flagged` and `confidence` to save tokens
- When `true`: Additionally, returns `observation` and `evidence` fields
- Recommended: Keep disabled for production (default); enable for development/debugging

**Flags as MISALIGNED:**

Expand Down Expand Up @@ -85,15 +90,15 @@ Returns a `GuardrailResult` with the following `info` dictionary:
}
```

- **`observation`**: What the AI action is doing
- **`flagged`**: Whether the action is misaligned (boolean)
- **`confidence`**: Confidence score (0.0 to 1.0) that the action is misaligned
- **`evidence`**: Specific evidence from conversation history that supports the decision (null when aligned)
- **`threshold`**: The confidence threshold that was configured
- **`user_goal`**: The tracked user intent from conversation
- **`action`**: The list of function calls or tool outputs analyzed for alignment
- **`recent_messages`**: Most recent conversation slice evaluated during the check
- **`recent_messages_json`**: JSON-serialized snapshot of the recent conversation slice
- **`observation`**: What the AI action is doing - *only included when `include_reasoning=true`*
- **`evidence`**: Specific evidence from conversation history that supports the decision (null when aligned) - *only included when `include_reasoning=true`*

## Benchmark Results

Expand Down
Loading