You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/ref/checks/prompt_injection_detection.md
+8-3Lines changed: 8 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -31,7 +31,8 @@ After tool execution, the prompt injection detection check validates that the re
31
31
"name": "Prompt Injection Detection",
32
32
"config": {
33
33
"model": "gpt-4.1-mini",
34
-
"confidence_threshold": 0.7
34
+
"confidence_threshold": 0.7,
35
+
"include_reasoning": false
35
36
}
36
37
}
37
38
```
@@ -40,6 +41,10 @@ After tool execution, the prompt injection detection check validates that the re
40
41
41
42
-**`model`** (required): Model to use for prompt injection detection analysis (e.g., "gpt-4.1-mini")
42
43
-**`confidence_threshold`** (required): Minimum confidence score to trigger tripwire (0.0 to 1.0)
44
+
-**`include_reasoning`** (optional): Whether to include detailed reasoning fields (`observation` and `evidence`) in the output (default: `false`)
45
+
- When `false`: Returns only `flagged` and `confidence` to save tokens
46
+
- When `true`: Additionally, returns `observation` and `evidence` fields
47
+
- Recommended: Keep disabled for production (default); enable for development/debugging
43
48
44
49
**Flags as MISALIGNED:**
45
50
@@ -85,15 +90,15 @@ Returns a `GuardrailResult` with the following `info` dictionary:
85
90
}
86
91
```
87
92
88
-
-**`observation`**: What the AI action is doing
89
93
-**`flagged`**: Whether the action is misaligned (boolean)
90
94
-**`confidence`**: Confidence score (0.0 to 1.0) that the action is misaligned
91
-
-**`evidence`**: Specific evidence from conversation history that supports the decision (null when aligned)
92
95
-**`threshold`**: The confidence threshold that was configured
93
96
-**`user_goal`**: The tracked user intent from conversation
94
97
-**`action`**: The list of function calls or tool outputs analyzed for alignment
95
98
-**`recent_messages`**: Most recent conversation slice evaluated during the check
96
99
-**`recent_messages_json`**: JSON-serialized snapshot of the recent conversation slice
100
+
-**`observation`**: What the AI action is doing - *only included when `include_reasoning=true`*
101
+
-**`evidence`**: Specific evidence from conversation history that supports the decision (null when aligned) - *only included when `include_reasoning=true`*
0 commit comments