You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/ref/checks/custom_prompt_check.md
+6Lines changed: 6 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -20,6 +20,11 @@ Implements custom content checks using configurable LLM prompts. Uses your custo
20
20
-**`model`** (required): Model to use for the check (e.g., "gpt-5")
21
21
-**`confidence_threshold`** (required): Minimum confidence score to trigger tripwire (0.0 to 1.0)
22
22
-**`system_prompt_details`** (required): Custom instructions defining the content detection criteria
23
+
-**`include_reasoning`** (optional): Whether to include reasoning/explanation fields in the guardrail output (default: `false`)
24
+
- When `false`: The LLM only generates the essential fields (`flagged` and `confidence`), reducing token generation costs
25
+
- When `true`: Additionally, returns detailed reasoning for its decisions
26
+
-**Performance**: In our evaluations, disabling reasoning reduces median latency by 40% on average (ranging from 18% to 67% depending on model) while maintaining detection performance
27
+
-**Use Case**: Keep disabled for production to minimize costs and latency; enable for development and debugging
23
28
24
29
## Implementation Notes
25
30
@@ -42,3 +47,4 @@ Returns a `GuardrailResult` with the following `info` dictionary:
42
47
-**`flagged`**: Whether the custom validation criteria were met
43
48
-**`confidence`**: Confidence score (0.0 to 1.0) for the validation
44
49
-**`threshold`**: The confidence threshold that was configured
50
+
-**`reason`**: Explanation of why the input was flagged (or not flagged) - *only included when `include_reasoning=true`*
Copy file name to clipboardExpand all lines: docs/ref/checks/hallucination_detection.md
+16-8Lines changed: 16 additions & 8 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -14,7 +14,8 @@ Flags model text containing factual claims that are clearly contradicted or not
14
14
"config": {
15
15
"model": "gpt-4.1-mini",
16
16
"confidence_threshold": 0.7,
17
-
"knowledge_source": "vs_abc123"
17
+
"knowledge_source": "vs_abc123",
18
+
"include_reasoning": false
18
19
}
19
20
}
20
21
```
@@ -24,6 +25,11 @@ Flags model text containing factual claims that are clearly contradicted or not
24
25
-**`model`** (required): OpenAI model (required) to use for validation (e.g., "gpt-4.1-mini")
25
26
-**`confidence_threshold`** (required): Minimum confidence score to trigger tripwire (0.0 to 1.0)
26
27
-**`knowledge_source`** (required): OpenAI vector store ID starting with "vs_" containing reference documents
28
+
-**`include_reasoning`** (optional): Whether to include detailed reasoning fields in the output (default: `false`)
29
+
- When `false`: Returns only `flagged` and `confidence` to save tokens
30
+
- When `true`: Additionally, returns `reasoning`, `hallucination_type`, `hallucinated_statements`, and `verified_statements`
31
+
-**Performance**: In our evaluations, disabling reasoning reduces median latency by 40% on average (ranging from 18% to 67% depending on model) while maintaining detection performance
32
+
-**Use Case**: Keep disabled for production to minimize costs and latency; enable for development and debugging
27
33
28
34
### Tuning guidance
29
35
@@ -102,7 +108,9 @@ See [`examples/hallucination_detection/`](https://github.com/openai/openai-guard
102
108
103
109
## What It Returns
104
110
105
-
Returns a `GuardrailResult` with the following `info` dictionary:
111
+
Returns a `GuardrailResult` with the following `info` dictionary.
112
+
113
+
**With `include_reasoning=true`:**
106
114
107
115
```json
108
116
{
@@ -117,15 +125,15 @@ Returns a `GuardrailResult` with the following `info` dictionary:
117
125
}
118
126
```
119
127
128
+
### Fields
129
+
120
130
-**`flagged`**: Whether the content was flagged as potentially hallucinated
121
131
-**`confidence`**: Confidence score (0.0 to 1.0) for the detection
122
-
-**`reasoning`**: Explanation of why the content was flagged
123
-
-**`hallucination_type`**: Type of issue detected (e.g., "factual_error", "unsupported_claim")
124
-
-**`hallucinated_statements`**: Specific statements that are contradicted or unsupported
125
-
-**`verified_statements`**: Statements that are supported by your documents
126
132
-**`threshold`**: The confidence threshold that was configured
127
-
128
-
Tip: `hallucination_type` is typically one of `factual_error`, `unsupported_claim`, or `none`.
133
+
-**`reasoning`**: Explanation of why the content was flagged - *only included when `include_reasoning=true`*
134
+
-**`hallucination_type`**: Type of issue detected (e.g., "factual_error", "unsupported_claim", "none") - *only included when `include_reasoning=true`*
135
+
-**`hallucinated_statements`**: Specific statements that are contradicted or unsupported - *only included when `include_reasoning=true`*
136
+
-**`verified_statements`**: Statements that are supported by your documents - *only included when `include_reasoning=true`*
Copy file name to clipboardExpand all lines: docs/ref/checks/jailbreak.md
+8-2Lines changed: 8 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -33,7 +33,8 @@ Detects attempts to bypass safety or policy constraints via manipulation (prompt
33
33
"name": "Jailbreak",
34
34
"config": {
35
35
"model": "gpt-4.1-mini",
36
-
"confidence_threshold": 0.7
36
+
"confidence_threshold": 0.7,
37
+
"include_reasoning": false
37
38
}
38
39
}
39
40
```
@@ -42,6 +43,11 @@ Detects attempts to bypass safety or policy constraints via manipulation (prompt
42
43
43
44
-**`model`** (required): Model to use for detection (e.g., "gpt-4.1-mini")
44
45
-**`confidence_threshold`** (required): Minimum confidence score to trigger tripwire (0.0 to 1.0)
46
+
-**`include_reasoning`** (optional): Whether to include reasoning/explanation fields in the guardrail output (default: `false`)
47
+
- When `false`: The LLM only generates the essential fields (`flagged` and `confidence`), reducing token generation costs
48
+
- When `true`: Additionally, returns detailed reasoning for its decisions
49
+
-**Performance**: In our evaluations, disabling reasoning reduces median latency by 40% on average (ranging from 18% to 67% depending on model) while maintaining detection performance
50
+
-**Use Case**: Keep disabled for production to minimize costs and latency; enable for development and debugging
45
51
46
52
### Tuning guidance
47
53
@@ -70,7 +76,7 @@ Returns a `GuardrailResult` with the following `info` dictionary:
70
76
-**`flagged`**: Whether a jailbreak attempt was detected
71
77
-**`confidence`**: Confidence score (0.0 to 1.0) for the detection
72
78
-**`threshold`**: The confidence threshold that was configured
73
-
-**`reason`**: Explanation of why the input was flagged (or not flagged)
79
+
-**`reason`**: Explanation of why the input was flagged (or not flagged) - *only included when `include_reasoning=true`*
74
80
-**`used_conversation_history`**: Boolean indicating whether conversation history was analyzed
75
81
-**`checked_text`**: JSON payload containing the conversation history and latest input that was analyzed
Copy file name to clipboardExpand all lines: docs/ref/checks/llm_base.md
+7-1Lines changed: 7 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -9,7 +9,8 @@ Base configuration for LLM-based guardrails. Provides common configuration optio
9
9
"name": "LLM Base",
10
10
"config": {
11
11
"model": "gpt-5",
12
-
"confidence_threshold": 0.7
12
+
"confidence_threshold": 0.7,
13
+
"include_reasoning": false
13
14
}
14
15
}
15
16
```
@@ -18,6 +19,11 @@ Base configuration for LLM-based guardrails. Provides common configuration optio
18
19
19
20
-**`model`** (required): OpenAI model to use for the check (e.g., "gpt-5")
20
21
-**`confidence_threshold`** (required): Minimum confidence score to trigger tripwire (0.0 to 1.0)
22
+
-**`include_reasoning`** (optional): Whether to include reasoning/explanation fields in the guardrail output (default: `false`)
23
+
- When `true`: The LLM generates and returns detailed reasoning for its decisions (e.g., `reason`, `reasoning`, `observation`, `evidence` fields)
24
+
- When `false`: The LLM only returns the essential fields (`flagged` and `confidence`), reducing token generation costs
25
+
-**Performance**: In our evaluations, disabling reasoning reduces median latency by 40% on average (ranging from 18% to 67% depending on model) while maintaining detection performance
26
+
-**Use Case**: Keep disabled for production to minimize costs and latency; enable for development and debugging
Copy file name to clipboardExpand all lines: docs/ref/checks/nsfw.md
+6Lines changed: 6 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -29,6 +29,11 @@ Flags workplace‑inappropriate model outputs: explicit sexual content, profanit
29
29
30
30
-**`model`** (required): Model to use for detection (e.g., "gpt-4.1-mini")
31
31
-**`confidence_threshold`** (required): Minimum confidence score to trigger tripwire (0.0 to 1.0)
32
+
-**`include_reasoning`** (optional): Whether to include reasoning/explanation fields in the guardrail output (default: `false`)
33
+
- When `false`: The LLM only generates the essential fields (`flagged` and `confidence`), reducing token generation costs
34
+
- When `true`: Additionally, returns detailed reasoning for its decisions
35
+
-**Performance**: In our evaluations, disabling reasoning reduces median latency by 40% on average (ranging from 18% to 67% depending on model) while maintaining detection performance
36
+
-**Use Case**: Keep disabled for production to minimize costs and latency; enable for development and debugging
32
37
33
38
### Tuning guidance
34
39
@@ -51,6 +56,7 @@ Returns a `GuardrailResult` with the following `info` dictionary:
51
56
-**`flagged`**: Whether NSFW content was detected
52
57
-**`confidence`**: Confidence score (0.0 to 1.0) for the detection
53
58
-**`threshold`**: The confidence threshold that was configured
59
+
-**`reason`**: Explanation of why the input was flagged (or not flagged) - *only included when `include_reasoning=true`*
Copy file name to clipboardExpand all lines: docs/ref/checks/off_topic_prompts.md
+8-2Lines changed: 8 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -20,6 +20,11 @@ Ensures content stays within defined business scope using LLM analysis. Flags co
20
20
-**`model`** (required): Model to use for analysis (e.g., "gpt-5")
21
21
-**`confidence_threshold`** (required): Minimum confidence score to trigger tripwire (0.0 to 1.0)
22
22
-**`system_prompt_details`** (required): Description of your business scope and acceptable topics
23
+
-**`include_reasoning`** (optional): Whether to include reasoning/explanation fields in the guardrail output (default: `false`)
24
+
- When `false`: The LLM only generates the essential fields (`flagged` and `confidence`), reducing token generation costs
25
+
- When `true`: Additionally, returns detailed reasoning for its decisions
26
+
-**Performance**: In our evaluations, disabling reasoning reduces median latency by 40% on average (ranging from 18% to 67% depending on model) while maintaining detection performance
27
+
-**Use Case**: Keep disabled for production to minimize costs and latency; enable for development and debugging
23
28
24
29
## Implementation Notes
25
30
@@ -39,6 +44,7 @@ Returns a `GuardrailResult` with the following `info` dictionary:
39
44
}
40
45
```
41
46
42
-
-**`flagged`**: Whether the content aligns with your business scope
43
-
-**`confidence`**: Confidence score (0.0 to 1.0) for the prompt injection detection assessment
47
+
-**`flagged`**: Whether the content is off-topic (outside your business scope)
48
+
-**`confidence`**: Confidence score (0.0 to 1.0) for the assessment
44
49
-**`threshold`**: The confidence threshold that was configured
50
+
-**`reason`**: Explanation of why the input was flagged (or not flagged) - *only included when `include_reasoning=true`*
Copy file name to clipboardExpand all lines: docs/ref/checks/prompt_injection_detection.md
+11-2Lines changed: 11 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -31,7 +31,8 @@ After tool execution, the prompt injection detection check validates that the re
31
31
"name": "Prompt Injection Detection",
32
32
"config": {
33
33
"model": "gpt-4.1-mini",
34
-
"confidence_threshold": 0.7
34
+
"confidence_threshold": 0.7,
35
+
"include_reasoning": false
35
36
}
36
37
}
37
38
```
@@ -40,6 +41,11 @@ After tool execution, the prompt injection detection check validates that the re
40
41
41
42
-**`model`** (required): Model to use for prompt injection detection analysis (e.g., "gpt-4.1-mini")
42
43
-**`confidence_threshold`** (required): Minimum confidence score to trigger tripwire (0.0 to 1.0)
44
+
-**`include_reasoning`** (optional): Whether to include the `observation` and `evidence` fields in the output (default: `false`)
45
+
- When `true`: Returns detailed `observation` explaining what the action is doing and `evidence` with specific quotes/details
46
+
- When `false`: Omits reasoning fields to save tokens (typically 100-300 tokens per check)
47
+
-**Performance**: In our evaluations, disabling reasoning reduces median latency by 40% on average (ranging from 18% to 67% depending on model) while maintaining detection performance
48
+
-**Use Case**: Keep disabled for production to minimize costs and latency; enable for development and debugging
43
49
44
50
**Flags as MISALIGNED:**
45
51
@@ -77,13 +83,16 @@ Returns a `GuardrailResult` with the following `info` dictionary:
77
83
}
78
84
```
79
85
80
-
-**`observation`**: What the AI action is doing
86
+
-**`observation`**: What the AI action is doing - *only included when `include_reasoning=true`*
81
87
-**`flagged`**: Whether the action is misaligned (boolean)
82
88
-**`confidence`**: Confidence score (0.0 to 1.0) that the action is misaligned
89
+
-**`evidence`**: Specific evidence from conversation supporting the decision - *only included when `include_reasoning=true`*
83
90
-**`threshold`**: The confidence threshold that was configured
84
91
-**`user_goal`**: The tracked user intent from conversation
85
92
-**`action`**: The list of function calls or tool outputs analyzed for alignment
86
93
94
+
**Note**: When `include_reasoning=false` (the default), the `observation` and `evidence` fields are omitted to reduce token generation costs.
0 commit comments