Feature Request: Multimodal Vision-Based Email Phishing Triage + Bounded Browser Inspection

# Feature Request: Multimodal Vision-Based Email Phishing Triage + Bounded Browser Inspection

## The Problem

Current email security tools analyze emails through **metadata, headers, and text pattern matching** (KQL queries, regex rules, IOC lookups). This misses a critical attack vector: emails that are **visually deceptive** — pixel-perfect brand impersonation, fake login forms, urgency cues, and social engineering layouts that fool humans at the visual level.

SOC analysts triage emails **by looking at them**. No existing Security Copilot plugin replicates this visual analysis.

Additionally, when suspicious URLs are found in emails, analysts must manually visit them in sandboxed browsers to determine their purpose (credential harvesting pages, redirectors, download pages). This is time-consuming and risky.

## Proposed Solution: Two Capabilities

### 1. Multimodal QuickLook — Visual Email Triage

**Concept:** Render the email as a screenshot (safe, no remote content loaded), then send it to a multimodal LLM (GPT-4o, Gemini, etc.) alongside email metadata for visual analysis.

**What the LLM sees (just like an analyst would):**
- Brand logo placement and quality
- Layout and formatting anomalies
- Urgency cues (red text, warning icons, countdown language)
- Fake login forms or credential request patterns
- Link display text vs. actual URL mismatches
- Generic greetings vs. personalization

**What it returns:**
- Suspicion score (0.0 - 1.0)
- Classification label (legitimate, suspicious, likely_phishing, credential_harvesting, malware_delivery)
- Visual signals detected ("fake login form", "brand logo mismatch", "urgency language")
- Header signals ("reply-to mismatch", "spoofed sender domain")
- Content signals ("credential request via link", "threat of account compromise")
- Recommended next step (dismiss, monitor, inspect URLs, detonate attachments)

**Key innovation:** The LLM analyzes the email **as rendered** — exactly how a human would see it. This catches visual social engineering that text-only analysis completely misses.

**Safety:** The email is rendered in a headless browser with all remote content blocked (images, fonts, tracking pixels). The LLM receives a static screenshot — no execution, no network calls from the email content.

### 2. QuickBrowse — Bounded Automated URL Inspection

**Concept:** When QuickLook flags suspicious URLs, automatically dispatch a headless browser to visit them with strict safety boundaries, then use the same multimodal LLM to analyze what the browser finds.

**How it works:**
1. Open URL in a headless Chromium browser (isolated context, no persistent state)
2. LLM planner decides next action based on page state + screenshot: extract forms, follow redirects, stop
3. Capture page screenshots, form structures, redirect chains, domains contacted
4. LLM determines page purpose: credential harvesting, malware download, redirect chain, legitimate content
5. All network activity logged (domains, IPs resolved, GeoIP, ASN)

**Safety boundaries:**
- Maximum hop count (no infinite redirect following)
- Maximum action count (no endless exploration)
- Domain-hop restriction (stay within one hop of original domain)
- No credential submission, no form filling
- No file downloads
- Hard timeout on all operations
- Ephemeral browser context (destroyed after each inspection)

**What it produces:**
- Final URL after all redirects
- Forms detected (credential harvesting indicators)
- Domains contacted with IP, GeoIP, and ASN resolution
- Page screenshots at each navigation step
- Process tree of browser navigation (rendered as behavioral analysis)

## Architecture Overview

```
Email (.eml)
    |
    v
Safe Renderer (headless browser, no remote content)
    |
    v
Screenshot + Metadata + Deterministic Features
    |
    v
Multimodal LLM (visual analysis)
    |
    +--> Score < 0.4: Dismiss (legitimate)
    +--> Score 0.4-0.8: QuickBrowse URLs (suspicious)
    +--> Score > 0.8: Full sandbox detonation (high risk)
    |
    v
QuickBrowse (for suspicious URLs)
    |
    v
LLM-guided browser navigation + page analysis
    |
    v
Unified Report (visual signals + network + behavior)
```

## Why This Matters

- **3000+ emails/day** in enterprise environments need automated triage
- **Visual social engineering** is the #1 phishing technique — and it's invisible to text-only analysis
- **URL inspection** currently requires manual analyst work or separate sandbox tools
- Combining **vision + browser automation + LLM reasoning** creates a triage pipeline that thinks like an analyst

## Proof of Concept

We have built and tested this approach in an open-source project. Results on real phishing samples:

| Email | Visual Verdict | Score | Key Signals |
|-------|---------------|-------|-------------|
| Fake account renewal | likely_phishing | 0.80 | generic greeting, urgency, suspicious sender domain |
| Fake transaction completion | likely_phishing | 0.75 | urgency language, credential request |
| Credential harvesting attempt | credential_harvesting | 0.90 | fake login form, brand impersonation, reply-to mismatch |
| Legitimate plain text | legitimate | 0.10 | no signals detected |
| Malware attachment delivery | malware_delivery | 0.85 | suspicious attachment, threat language |

The multimodal LLM correctly identified visual phishing indicators that would be invisible to header/text-only analysis.

## Integration with Security Copilot

This could integrate as:
- A **custom plugin** that calls Azure OpenAI GPT-4o for visual analysis
- A **Logic App** workflow triggered by email ingestion
- An **API-based skill** that SOC analysts invoke during investigations

The approach is model-agnostic — it works with any multimodal LLM that supports vision (GPT-4o, Gemini, Claude, open-weight models).


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request: Multimodal Vision-Based Email Phishing Triage + Bounded Browser Inspection #220

Feature Request: Multimodal Vision-Based Email Phishing Triage + Bounded Browser Inspection

The Problem

Proposed Solution: Two Capabilities

1. Multimodal QuickLook — Visual Email Triage

2. QuickBrowse — Bounded Automated URL Inspection

Architecture Overview

Why This Matters

Proof of Concept

Integration with Security Copilot

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Email	Visual Verdict	Score	Key Signals
Fake account renewal	likely_phishing	0.80	generic greeting, urgency, suspicious sender domain
Fake transaction completion	likely_phishing	0.75	urgency language, credential request
Credential harvesting attempt	credential_harvesting	0.90	fake login form, brand impersonation, reply-to mismatch
Legitimate plain text	legitimate	0.10	no signals detected
Malware attachment delivery	malware_delivery	0.85	suspicious attachment, threat language

Feature Request: Multimodal Vision-Based Email Phishing Triage + Bounded Browser Inspection #220

Description

Feature Request: Multimodal Vision-Based Email Phishing Triage + Bounded Browser Inspection

The Problem

Proposed Solution: Two Capabilities

1. Multimodal QuickLook — Visual Email Triage

2. QuickBrowse — Bounded Automated URL Inspection

Architecture Overview

Why This Matters

Proof of Concept

Integration with Security Copilot

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions