Skip to content

Commit 43d9f2b

Browse files
authored
parameterize LLM returning reasoning (#54)
* parameterize LLM returning reasoning * Preserve reason field in error fallback message * Making new param optional * Fix prompt injection reporting errors
1 parent a8e8ace commit 43d9f2b

19 files changed

+913
-138
lines changed

docs/ref/checks/custom_prompt_check.md

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,8 @@ Implements custom content checks using configurable LLM prompts. Uses your custo
1010
"config": {
1111
"model": "gpt-5",
1212
"confidence_threshold": 0.7,
13-
"system_prompt_details": "Determine if the user's request needs to be escalated to a senior support agent. Indications of escalation include: ..."
13+
"system_prompt_details": "Determine if the user's request needs to be escalated to a senior support agent. Indications of escalation include: ...",
14+
"include_reasoning": false
1415
}
1516
}
1617
```
@@ -20,6 +21,10 @@ Implements custom content checks using configurable LLM prompts. Uses your custo
2021
- **`model`** (required): Model to use for the check (e.g., "gpt-5")
2122
- **`confidence_threshold`** (required): Minimum confidence score to trigger tripwire (0.0 to 1.0)
2223
- **`system_prompt_details`** (required): Custom instructions defining the content detection criteria
24+
- **`include_reasoning`** (optional): Whether to include reasoning/explanation fields in the guardrail output (default: `false`)
25+
- When `false`: The LLM only generates the essential fields (`flagged` and `confidence`), reducing token generation costs
26+
- When `true`: Additionally, returns detailed reasoning for its decisions
27+
- **Use Case**: Keep disabled for production to minimize costs; enable for development and debugging
2328

2429
## Implementation Notes
2530

@@ -42,3 +47,4 @@ Returns a `GuardrailResult` with the following `info` dictionary:
4247
- **`flagged`**: Whether the custom validation criteria were met
4348
- **`confidence`**: Confidence score (0.0 to 1.0) for the validation
4449
- **`threshold`**: The confidence threshold that was configured
50+
- **`reason`**: Explanation of why the input was flagged (or not flagged) - *only included when `include_reasoning=true`*

docs/ref/checks/hallucination_detection.md

Lines changed: 15 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,8 @@ Flags model text containing factual claims that are clearly contradicted or not
1414
"config": {
1515
"model": "gpt-4.1-mini",
1616
"confidence_threshold": 0.7,
17-
"knowledge_source": "vs_abc123"
17+
"knowledge_source": "vs_abc123",
18+
"include_reasoning": false
1819
}
1920
}
2021
```
@@ -24,6 +25,10 @@ Flags model text containing factual claims that are clearly contradicted or not
2425
- **`model`** (required): OpenAI model (required) to use for validation (e.g., "gpt-4.1-mini")
2526
- **`confidence_threshold`** (required): Minimum confidence score to trigger tripwire (0.0 to 1.0)
2627
- **`knowledge_source`** (required): OpenAI vector store ID starting with "vs_" containing reference documents
28+
- **`include_reasoning`** (optional): Whether to include detailed reasoning fields in the output (default: `false`)
29+
- When `false`: Returns only `flagged` and `confidence` to save tokens
30+
- When `true`: Additionally, returns `reasoning`, `hallucination_type`, `hallucinated_statements`, and `verified_statements`
31+
- Recommended: Keep disabled for production (default); enable for development/debugging
2732

2833
### Tuning guidance
2934

@@ -103,7 +108,9 @@ See [`examples/`](https://github.com/openai/openai-guardrails-js/tree/main/examp
103108

104109
## What It Returns
105110

106-
Returns a `GuardrailResult` with the following `info` dictionary:
111+
Returns a `GuardrailResult` with the following `info` dictionary.
112+
113+
**With `include_reasoning=true`:**
107114

108115
```json
109116
{
@@ -118,15 +125,15 @@ Returns a `GuardrailResult` with the following `info` dictionary:
118125
}
119126
```
120127

128+
### Fields
129+
121130
- **`flagged`**: Whether the content was flagged as potentially hallucinated
122131
- **`confidence`**: Confidence score (0.0 to 1.0) for the detection
123-
- **`reasoning`**: Explanation of why the content was flagged
124-
- **`hallucination_type`**: Type of issue detected (e.g., "factual_error", "unsupported_claim")
125-
- **`hallucinated_statements`**: Specific statements that are contradicted or unsupported
126-
- **`verified_statements`**: Statements that are supported by your documents
127132
- **`threshold`**: The confidence threshold that was configured
128-
129-
Tip: `hallucination_type` is typically one of `factual_error`, `unsupported_claim`, or `none`.
133+
- **`reasoning`**: Explanation of why the content was flagged - *only included when `include_reasoning=true`*
134+
- **`hallucination_type`**: Type of issue detected (e.g., "factual_error", "unsupported_claim", "none") - *only included when `include_reasoning=true`*
135+
- **`hallucinated_statements`**: Specific statements that are contradicted or unsupported - *only included when `include_reasoning=true`*
136+
- **`verified_statements`**: Statements that are supported by your documents - *only included when `include_reasoning=true`*
130137

131138
## Benchmark Results
132139

docs/ref/checks/jailbreak.md

Lines changed: 7 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -33,7 +33,8 @@ Detects attempts to bypass safety or policy constraints via manipulation (prompt
3333
"name": "Jailbreak",
3434
"config": {
3535
"model": "gpt-4.1-mini",
36-
"confidence_threshold": 0.7
36+
"confidence_threshold": 0.7,
37+
"include_reasoning": false
3738
}
3839
}
3940
```
@@ -42,6 +43,10 @@ Detects attempts to bypass safety or policy constraints via manipulation (prompt
4243

4344
- **`model`** (required): Model to use for detection (e.g., "gpt-4.1-mini")
4445
- **`confidence_threshold`** (required): Minimum confidence score to trigger tripwire (0.0 to 1.0)
46+
- **`include_reasoning`** (optional): Whether to include reasoning/explanation fields in the guardrail output (default: `false`)
47+
- When `false`: The LLM only generates the essential fields (`flagged` and `confidence`), reducing token generation costs
48+
- When `true`: Additionally, returns detailed reasoning for its decisions
49+
- **Use Case**: Keep disabled for production to minimize costs; enable for development and debugging
4550

4651
### Tuning guidance
4752

@@ -68,7 +73,7 @@ Returns a `GuardrailResult` with the following `info` dictionary:
6873
- **`flagged`**: Whether a jailbreak attempt was detected
6974
- **`confidence`**: Confidence score (0.0 to 1.0) for the detection
7075
- **`threshold`**: The confidence threshold that was configured
71-
- **`reason`**: Natural language rationale describing why the request was (or was not) flagged
76+
- **`reason`**: Natural language rationale describing why the request was (or was not) flagged - *only included when `include_reasoning=true`*
7277
- **`used_conversation_history`**: Indicates whether prior conversation turns were included
7378
- **`checked_text`**: JSON payload containing the conversation slice and latest input analyzed
7479

docs/ref/checks/llm_base.md

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,8 @@ Base configuration for LLM-based guardrails. Provides common configuration optio
1111
"name": "NSFW Text", // or "Jailbreak", "Hallucination Detection", etc.
1212
"config": {
1313
"model": "gpt-5",
14-
"confidence_threshold": 0.7
14+
"confidence_threshold": 0.7,
15+
"include_reasoning": false
1516
}
1617
}
1718
```
@@ -20,6 +21,10 @@ Base configuration for LLM-based guardrails. Provides common configuration optio
2021

2122
- **`model`** (required): OpenAI model to use for the check (e.g., "gpt-5")
2223
- **`confidence_threshold`** (required): Minimum confidence score to trigger tripwire (0.0 to 1.0)
24+
- **`include_reasoning`** (optional): Whether to include reasoning/explanation fields in the guardrail output (default: `false`)
25+
- When `false`: The LLM only generates the essential fields (`flagged` and `confidence`), reducing token generation costs
26+
- When `true`: Additionally, returns detailed reasoning for its decisions
27+
- **Use Case**: Keep disabled for production to minimize costs; enable for development and debugging
2328

2429
## What It Does
2530

docs/ref/checks/nsfw.md

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,8 @@ Flags workplace‑inappropriate model outputs: explicit sexual content, profanit
2020
"name": "NSFW Text",
2121
"config": {
2222
"model": "gpt-4.1-mini",
23-
"confidence_threshold": 0.7
23+
"confidence_threshold": 0.7,
24+
"include_reasoning": false
2425
}
2526
}
2627
```
@@ -29,6 +30,10 @@ Flags workplace‑inappropriate model outputs: explicit sexual content, profanit
2930

3031
- **`model`** (required): Model to use for detection (e.g., "gpt-4.1-mini")
3132
- **`confidence_threshold`** (required): Minimum confidence score to trigger tripwire (0.0 to 1.0)
33+
- **`include_reasoning`** (optional): Whether to include reasoning/explanation fields in the guardrail output (default: `false`)
34+
- When `false`: The LLM only generates the essential fields (`flagged` and `confidence`), reducing token generation costs
35+
- When `true`: Additionally, returns detailed reasoning for its decisions
36+
- **Use Case**: Keep disabled for production to minimize costs; enable for development and debugging
3237

3338
### Tuning guidance
3439

@@ -51,6 +56,7 @@ Returns a `GuardrailResult` with the following `info` dictionary:
5156
- **`flagged`**: Whether NSFW content was detected
5257
- **`confidence`**: Confidence score (0.0 to 1.0) for the detection
5358
- **`threshold`**: The confidence threshold that was configured
59+
- **`reason`**: Explanation of why the input was flagged (or not flagged) - *only included when `include_reasoning=true`*
5460

5561
### Examples
5662

docs/ref/checks/off_topic_prompts.md

Lines changed: 9 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,8 @@ Ensures content stays within defined business scope using LLM analysis. Flags co
1010
"config": {
1111
"model": "gpt-5",
1212
"confidence_threshold": 0.7,
13-
"system_prompt_details": "Customer support for our e-commerce platform. Topics include order status, returns, shipping, and product questions."
13+
"system_prompt_details": "Customer support for our e-commerce platform. Topics include order status, returns, shipping, and product questions.",
14+
"include_reasoning": false
1415
}
1516
}
1617
```
@@ -20,6 +21,10 @@ Ensures content stays within defined business scope using LLM analysis. Flags co
2021
- **`model`** (required): Model to use for analysis (e.g., "gpt-5")
2122
- **`confidence_threshold`** (required): Minimum confidence score to trigger tripwire (0.0 to 1.0)
2223
- **`system_prompt_details`** (required): Description of your business scope and acceptable topics
24+
- **`include_reasoning`** (optional): Whether to include reasoning/explanation fields in the guardrail output (default: `false`)
25+
- When `false`: The LLM only generates the essential fields (`flagged` and `confidence`), reducing token generation costs
26+
- When `true`: Additionally, returns detailed reasoning for its decisions
27+
- **Use Case**: Keep disabled for production to minimize costs; enable for development and debugging
2328

2429
## Implementation Notes
2530

@@ -40,7 +45,7 @@ Returns a `GuardrailResult` with the following `info` dictionary:
4045
}
4146
```
4247

43-
- **`flagged`**: Whether the content aligns with your business scope
44-
- **`confidence`**: Confidence score (0.0 to 1.0) for the prompt injection detection assessment
48+
- **`flagged`**: Whether the content is off-topic (outside your business scope)
49+
- **`confidence`**: Confidence score (0.0 to 1.0) for the assessment
4550
- **`threshold`**: The confidence threshold that was configured
46-
- **`business_scope`**: Copy of the scope provided in configuration
51+
- **`reason`**: Explanation of why the input was flagged (or not flagged) - *only included when `include_reasoning=true`*

docs/ref/checks/prompt_injection_detection.md

Lines changed: 8 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -31,7 +31,8 @@ After tool execution, the prompt injection detection check validates that the re
3131
"name": "Prompt Injection Detection",
3232
"config": {
3333
"model": "gpt-4.1-mini",
34-
"confidence_threshold": 0.7
34+
"confidence_threshold": 0.7,
35+
"include_reasoning": false
3536
}
3637
}
3738
```
@@ -40,6 +41,10 @@ After tool execution, the prompt injection detection check validates that the re
4041

4142
- **`model`** (required): Model to use for prompt injection detection analysis (e.g., "gpt-4.1-mini")
4243
- **`confidence_threshold`** (required): Minimum confidence score to trigger tripwire (0.0 to 1.0)
44+
- **`include_reasoning`** (optional): Whether to include detailed reasoning fields (`observation` and `evidence`) in the output (default: `false`)
45+
- When `false`: Returns only `flagged` and `confidence` to save tokens
46+
- When `true`: Additionally, returns `observation` and `evidence` fields
47+
- Recommended: Keep disabled for production (default); enable for development/debugging
4348

4449
**Flags as MISALIGNED:**
4550

@@ -85,15 +90,15 @@ Returns a `GuardrailResult` with the following `info` dictionary:
8590
}
8691
```
8792

88-
- **`observation`**: What the AI action is doing
8993
- **`flagged`**: Whether the action is misaligned (boolean)
9094
- **`confidence`**: Confidence score (0.0 to 1.0) that the action is misaligned
91-
- **`evidence`**: Specific evidence from conversation history that supports the decision (null when aligned)
9295
- **`threshold`**: The confidence threshold that was configured
9396
- **`user_goal`**: The tracked user intent from conversation
9497
- **`action`**: The list of function calls or tool outputs analyzed for alignment
9598
- **`recent_messages`**: Most recent conversation slice evaluated during the check
9699
- **`recent_messages_json`**: JSON-serialized snapshot of the recent conversation slice
100+
- **`observation`**: What the AI action is doing - *only included when `include_reasoning=true`*
101+
- **`evidence`**: Specific evidence from conversation history that supports the decision (null when aligned) - *only included when `include_reasoning=true`*
97102

98103
## Benchmark Results
99104

0 commit comments

Comments
 (0)