Skip to content

Feature Request: Multimodal Vision-Based Email Phishing Triage + Bounded Browser Inspection #220

@toonight

Description

@toonight

Feature Request: Multimodal Vision-Based Email Phishing Triage + Bounded Browser Inspection

The Problem

Current email security tools analyze emails through metadata, headers, and text pattern matching (KQL queries, regex rules, IOC lookups). This misses a critical attack vector: emails that are visually deceptive — pixel-perfect brand impersonation, fake login forms, urgency cues, and social engineering layouts that fool humans at the visual level.

SOC analysts triage emails by looking at them. No existing Security Copilot plugin replicates this visual analysis.

Additionally, when suspicious URLs are found in emails, analysts must manually visit them in sandboxed browsers to determine their purpose (credential harvesting pages, redirectors, download pages). This is time-consuming and risky.

Proposed Solution: Two Capabilities

1. Multimodal QuickLook — Visual Email Triage

Concept: Render the email as a screenshot (safe, no remote content loaded), then send it to a multimodal LLM (GPT-4o, Gemini, etc.) alongside email metadata for visual analysis.

What the LLM sees (just like an analyst would):

  • Brand logo placement and quality
  • Layout and formatting anomalies
  • Urgency cues (red text, warning icons, countdown language)
  • Fake login forms or credential request patterns
  • Link display text vs. actual URL mismatches
  • Generic greetings vs. personalization

What it returns:

  • Suspicion score (0.0 - 1.0)
  • Classification label (legitimate, suspicious, likely_phishing, credential_harvesting, malware_delivery)
  • Visual signals detected ("fake login form", "brand logo mismatch", "urgency language")
  • Header signals ("reply-to mismatch", "spoofed sender domain")
  • Content signals ("credential request via link", "threat of account compromise")
  • Recommended next step (dismiss, monitor, inspect URLs, detonate attachments)

Key innovation: The LLM analyzes the email as rendered — exactly how a human would see it. This catches visual social engineering that text-only analysis completely misses.

Safety: The email is rendered in a headless browser with all remote content blocked (images, fonts, tracking pixels). The LLM receives a static screenshot — no execution, no network calls from the email content.

2. QuickBrowse — Bounded Automated URL Inspection

Concept: When QuickLook flags suspicious URLs, automatically dispatch a headless browser to visit them with strict safety boundaries, then use the same multimodal LLM to analyze what the browser finds.

How it works:

  1. Open URL in a headless Chromium browser (isolated context, no persistent state)
  2. LLM planner decides next action based on page state + screenshot: extract forms, follow redirects, stop
  3. Capture page screenshots, form structures, redirect chains, domains contacted
  4. LLM determines page purpose: credential harvesting, malware download, redirect chain, legitimate content
  5. All network activity logged (domains, IPs resolved, GeoIP, ASN)

Safety boundaries:

  • Maximum hop count (no infinite redirect following)
  • Maximum action count (no endless exploration)
  • Domain-hop restriction (stay within one hop of original domain)
  • No credential submission, no form filling
  • No file downloads
  • Hard timeout on all operations
  • Ephemeral browser context (destroyed after each inspection)

What it produces:

  • Final URL after all redirects
  • Forms detected (credential harvesting indicators)
  • Domains contacted with IP, GeoIP, and ASN resolution
  • Page screenshots at each navigation step
  • Process tree of browser navigation (rendered as behavioral analysis)

Architecture Overview

Email (.eml)
    |
    v
Safe Renderer (headless browser, no remote content)
    |
    v
Screenshot + Metadata + Deterministic Features
    |
    v
Multimodal LLM (visual analysis)
    |
    +--> Score < 0.4: Dismiss (legitimate)
    +--> Score 0.4-0.8: QuickBrowse URLs (suspicious)
    +--> Score > 0.8: Full sandbox detonation (high risk)
    |
    v
QuickBrowse (for suspicious URLs)
    |
    v
LLM-guided browser navigation + page analysis
    |
    v
Unified Report (visual signals + network + behavior)

Why This Matters

  • 3000+ emails/day in enterprise environments need automated triage
  • Visual social engineering is the new additions to the files #1 phishing technique — and it's invisible to text-only analysis
  • URL inspection currently requires manual analyst work or separate sandbox tools
  • Combining vision + browser automation + LLM reasoning creates a triage pipeline that thinks like an analyst

Proof of Concept

We have built and tested this approach in an open-source project. Results on real phishing samples:

Email Visual Verdict Score Key Signals
Fake account renewal likely_phishing 0.80 generic greeting, urgency, suspicious sender domain
Fake transaction completion likely_phishing 0.75 urgency language, credential request
Credential harvesting attempt credential_harvesting 0.90 fake login form, brand impersonation, reply-to mismatch
Legitimate plain text legitimate 0.10 no signals detected
Malware attachment delivery malware_delivery 0.85 suspicious attachment, threat language

The multimodal LLM correctly identified visual phishing indicators that would be invisible to header/text-only analysis.

Integration with Security Copilot

This could integrate as:

  • A custom plugin that calls Azure OpenAI GPT-4o for visual analysis
  • A Logic App workflow triggered by email ingestion
  • An API-based skill that SOC analysts invoke during investigations

The approach is model-agnostic — it works with any multimodal LLM that supports vision (GPT-4o, Gemini, Claude, open-weight models).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions