Local LLMs Configuration

This document describes the local LLM setup for use with Open Code and other AI development tools.

Open Code Configuration
- Current Configuration
- Available Models
Custom Model Creation
Ollama Commands Reference
Model Selection Guidelines
Updating Configuration
Troubleshooting
Resources

Open Code Configuration

Open Code is configured via opencode.json in the repository root. This file defines available LLM providers and models.

Current Configuration

Provider: Ollama (local)

Base URL: http://localhost:11434/v1
NPM Package: @ai-sdk/openai-compatible

Available Models

Model	Size	Description
`qwen3:8b-16k`	5.2 GB	Qwen3 8B with extended 16k context window (custom)
`mistral-nemo:12b-instruct-2407-q4_K_M`	7.5 GB	Mistral Nemo 12B Instruct (quantized)
`qwen3:8b`	5.2 GB	Qwen3 8B standard model
`granite3.1-moe:latest`	2.0 GB	Granite 3.1 Mixture of Experts
`qwen3:4b`	2.5 GB	Qwen3 4B compact model

Custom Model Creation

Creating Qwen3 8B with Extended Context (16k)

The qwen3:8b-16k model is a custom variant created from the base qwen3:8b model with an extended context window.

Process:

# Start interactive session with base model
ollama run qwen3:8b

# Set extended context parameter (default is typically 8192)
>>> /set parameter num_ctx 16384
Set parameter 'num_ctx' to '16384'

# Save as new model variant
>>> /save qwen3:8b-16k
Created new model 'qwen3:8b-16k'

# Exit session
>>> /bye

Verification:

ollama list
# Output:
# NAME                                     ID              SIZE      MODIFIED
# qwen3:8b-16k                             7ef4ca800d20    5.2 GB    9 seconds ago
# mistral-nemo:12b-instruct-2407-q4_K_M    daf673741712    7.5 GB    14 hours ago
# qwen3:8b                                 500a1f067a9f    5.2 GB    18 hours ago
# granite3.1-moe:latest                    b43d80d7fca7    2.0 GB    20 hours ago
# qwen3:4b                                 e55aed6fe643    2.5 GB    2 months ago

Benefits of Extended Context

16k tokens vs standard 8k allows for larger code files and more context
Useful for analyzing entire application architectures
Better for reviewing multiple related files in a single prompt
No increase in model size (same 5.2 GB as base model)

Context Window Reality Check

Understanding what different context windows can actually hold:

Context Window	Approximate Words	Approximate Code	Typical Use Cases
4k tokens	~3,000 words	1 medium file	Single component analysis, quick edits
8k tokens	~6,000 words	1-2 medium files	Standard file operations, code reviews
16k tokens	~12,000 words	3-5 medium files	Multi-file analysis, related components
32k tokens	~24,000 words	6-10 medium files	Feature analysis across files
200k tokens	~150,000 words	Small-medium codebase	Whole codebase analysis, architecture review

Important notes:

Tokens ≠ Words: 1 token ≈ 0.75 words in English (varies by language and content type)
Code uses more tokens: Technical terms, symbols, and formatting consume more tokens than natural language
"Medium file": ~500-1000 lines of code (~20-40 KB)
Context includes prompts: System prompts, tool descriptions, and conversation history all consume tokens

Examples with this codebase:

# 8k context (qwen3:8b)
# Can analyze:
- Single theme file (e.g., resources/js/app.js)
- One block component
- README.md + quick overview

# 16k context (qwen3:8b-16k)
# Can analyze:
- Multiple related blocks (e.g., all files in resources/js/blocks/hero/)
- Theme configuration files (tailwind.config.js + theme.json + vite.config.js)
- Related components (index.php + style.css + script.js)

# 200k context (Claude Sonnet 4)
# Can analyze:
- Entire Nynaeve theme structure
- All blocks + configuration + documentation
- Cross-theme patterns (Nynaeve + Moiraine)
- Full project architecture (Trellis + Bedrock + themes)

Practical tip: If your prompt or file content gets cut off, your context window is too small for the task.

Ollama Commands Reference

List Models

ollama list

Run Model

ollama run <model-name>

Pull New Model

ollama pull <model-name>

Remove Model

ollama rm <model-name>

Interactive Commands (within `ollama run`)

/set parameter <name> <value> - Modify model parameters
/save <new-model-name> - Save current configuration as new model
/show - Display model information
/bye - Exit interactive session

Model Selection Guidelines

For Code Generation:

Mistral Nemo 12B: Best overall code quality, instruction following
Qwen3 8B: Fast, good balance of quality and speed

For Quick Tasks:

Qwen3 4B: Fastest responses, good for simple tasks
Granite 3.1 MoE: Efficient for varied tasks

For Large Context:

Qwen3 8B-16k: When analyzing multiple files or large codebases

Updating Configuration

When adding or removing Ollama models, update opencode.json:

{
  "$schema": "https://opencode.ai/config.json",
  "provider": {
    "ollama": {
      "npm": "@ai-sdk/openai-compatible",
      "name": "Ollama (local)",
      "options": {
        "baseURL": "http://localhost:11434/v1"
      },
      "models": {
        "model-name": {
          "name": "Display Name"
        }
      }
    }
  }
}

Troubleshooting

Ollama Not Running

# Check if Ollama is running
curl http://localhost:11434/v1/models

# Start Ollama (macOS)
ollama serve

Model Not Found

# Verify model exists
ollama list

# Pull model if missing
ollama pull <model-name>

Out of Memory

Use smaller models (Qwen3 4B, Granite 3.1 MoE)
Close other applications
Reduce context window with /set parameter num_ctx <value>

Slow Performance with Local Models

Problem: Open Code CLI runs very slowly with Ollama models, even for simple tasks.

Symptoms:

Simple file creation takes 10-30 seconds
Significant delay before seeing any output
Model generates verbose thinking output before action
Overall slower than Claude Code with cloud models

Causes:

Local model inference speed
- CPU/GPU limitations on local machine
- Model size vs. available VRAM
- Quantization level (Q4_K_M is faster than higher precision)
Extended context overhead
- 16k context models (like qwen3:8b-16k) use more memory
- KV cache: 2.2 GiB (16k) vs 576 MiB (8k)
- Slower token generation with larger context
Open Code CLI overhead
- Multiple tool calls for simple operations
- Verbose system prompts
- Thinking mode adds extra tokens to generate

Performance comparison (simple file write task):

Model	Context	Time	Notes
qwen3:8b-16k	16k	10-30s	Extended context, slower
qwen3:8b	8k	8-20s	Standard context
qwen3:4b	8k	5-15s	Smaller, faster
granite3.1-moe	8k	6-18s	Efficient MoE architecture
Claude Sonnet 4 (cloud)	200k	2-5s	API-based, much faster

Solutions:

Use smaller models for simple tasks:

# For quick file operations
opencode run "create file" --model ollama/qwen3:4b

# For analyzing large files or multiple related files
opencode run "analyze components in resources/js/blocks/" --model ollama/qwen3:8b-16k

Use standard context when extended context isn't needed:

# 8k context is sufficient for most single-file operations
opencode run "update file" --model ollama/qwen3:8b

Use cloud models for whole-codebase analysis:

# Claude Sonnet 4 has 200k context - much better for codebase-wide tasks
# Use Claude Code for tasks like:
# - "Analyze the entire authentication flow"
# - "Review all API endpoints"
# - "Find all instances of X pattern across the codebase"

Use Claude Code for interactive development:
- Claude Code with API access is significantly faster
- Better for real-time coding assistance
- Open Code CLI with local models is better for:
  - Offline work
  - Privacy-sensitive codebases
  - Batch operations where speed is less critical

Optimize Ollama settings:

# Reduce batch size for faster first token
# In Modelfile:
PARAMETER num_batch 256  # Default is 512

# Adjust context window
PARAMETER num_ctx 4096   # Smaller context = faster

Hardware considerations:
- Use GPU if available (Metal on macOS, CUDA on Linux/Windows)
- Close memory-intensive applications
- Monitor ollama serve logs for performance bottlenecks

When to use local vs. cloud models:

Use local models (Ollama) when:

Working offline
Processing sensitive/proprietary code
Running batch operations overnight
Learning/experimenting without API costs
Privacy requirements mandate local processing

Use cloud models (Claude API) when:

Real-time interactive development
Complex multi-file operations requiring fast iteration
Time-sensitive tasks
Working with very large codebases (200k+ context)
Speed is more important than cost

Note: Local model performance will improve with better hardware (more VRAM, faster GPU/CPU) and future Ollama optimizations.

Open Code CLI "Cannot Read Binary File" Error

Problem: When running Open Code CLI's /init command, you may encounter an error like:

Cannot read binary file: /Users/jasperfrumau/code/imagewize.com/AGENTS.md

Cause: This occurs when documentation files (like AGENTS.md) contain Unicode box-drawing characters used for tree structure visualization. These characters (like ├, │, └) are multi-byte UTF-8 sequences that some tools misinterpret as binary data.

Verification: Check if your file is detected as binary:

file AGENTS.md
# Bad: "AGENTS.md: data" (binary)
# Good: "AGENTS.md: ASCII text" or "AGENTS.md: UTF-8 Unicode text"

Solution 1: Remove all non-printable characters (Quick fix):

# Strip all non-ASCII printable characters
LC_ALL=C tr -cd '\11\12\15\40-\176' < AGENTS.md > AGENTS_clean.md
mv AGENTS_clean.md AGENTS.md

# Verify it's now recognized as text
file AGENTS.md
# Should show: "AGENTS.md: ASCII text"

Solution 2: Manual replacement - Replace Unicode tree characters with ASCII:

Instead of:

├── site/
│   ├── web/
│   │   └── app/

Use standard text formatting:

- site/
  - web/
    - app/

Or use spaces for indentation:

imagewize.com/
  site/          # Main site
  demo/          # Demo site
  trellis/       # Server provisioning

Why this happens:

Some text editors (like tree command output) use Unicode box-drawing characters (U+2500 to U+257F range)
These are valid UTF-8 but contain byte values outside the ASCII range
The file command and similar tools may classify files with these characters as "data" (binary)
Open Code CLI refuses to read files detected as binary for safety reasons

Prevention:

Don't copy/paste output from tree command into documentation
Use ASCII-only characters (-, *, spaces) for structure diagrams
Verify files with file command before committing

Note: The AGENTS.md file in this repository has been cleaned of all non-printable characters.

Open Code CLI `/init` Mode Issues

Problem 1: Agent starts in "thinking mode" instead of "build mode"

When running /init, the agent may start in thinking/planning mode even though the default Open Code mode should be "build". This causes the agent to spend excessive time planning instead of taking action.

Symptoms:

Agent displays <think> tags and lengthy analysis
Takes 2-3 minutes just to analyze the codebase before acting
Generates verbose planning output before executing tasks

Cause: The /init command may trigger plan agent behavior, or the agent interprets the initialization request as a planning task.

Solutions:

Skip /init entirely - If you have an existing AGENTS.md file:

# Just start Open Code CLI without /init
# It will automatically read AGENTS.md if present

Switch to build agent - After /init completes, press Tab to ensure build agent is active:

# Press Tab key to switch between plan and build agents
# Build agent is default for file creation tasks

Create AGENTS.md manually - Instead of using /init, create or update AGENTS.md yourself:
- Copy relevant sections from CLAUDE.md
- Add Open Code-specific configuration
- Format for AI agent consumption (no Unicode characters)

Problem 1a: Qwen3 models enter verbose thinking mode

Qwen3 models display <think> tags and spend time analyzing before taking action. This is model behavior, not an Open Code CLI issue.

Example:

opencode run "generate a todo.md file with the contents 'hello world, from qwen3' in it" --model ollama/qwen3:8b-16k
# Output shows:
# <think>
# Okay, the user wants me to generate a todo.md file...
# [lengthy analysis]
# </think>

Understanding:

This is inherent to Qwen3 model behavior with extended context
Build mode is already the default in Open Code CLI
There is no /no_think flag in Open Code CLI (only /thinking toggle which enables MORE thinking)
Task executes correctly but with verbose output (10-30 seconds for simple operations)

Best Approach: Accept the thinking mode as part of using Qwen3 models:

Tasks complete successfully despite verbosity
The thinking provides insight into model reasoning
Consider it "free documentation" of the decision-making process
Privacy benefits of local models outweigh the verbosity

Alternative: Use models with less thinking:

Mistral Nemo 12B: Minimal thinking but cannot create files (analysis only)
Granite 3.1 MoE: Fast but cannot create files (analysis only)
Qwen3:8b or Qwen3:4b: May have less thinking (needs testing, should support file creation)

Problem 2: "You must read the file before overwriting it" error

Error message:

Edit AGENTS.md [replaceAll=true]
You must read the file /Users/jasperfrumau/code/imagewize.com/AGENTS.md
before overwriting it. Use the Read tool first

Cause: Open Code CLI's safety mechanism requires the agent to read a file before overwriting it, but the /init workflow may attempt to write without reading first.

Solutions:

Manually edit AGENTS.md - Use a text editor to make changes instead of /init

Use incremental updates - Instead of full replacement:

# In Open Code CLI, request specific sections to be added
"Add a section about X to AGENTS.md"

Pre-existing AGENTS.md - If AGENTS.md exists and is working, don't run /init again:
- The file is already configured for Open Code
- Modifications can be made with targeted edit requests
- Running /init again may cause conflicts

Best Practice:

For repositories with existing AGENTS.md files:

DO: Use the existing file, make targeted updates as needed
DO: Reference CLAUDE.md for comprehensive documentation
DON'T: Run /init repeatedly
DON'T: Attempt full file replacements unless necessary

Note: This repository already has a properly configured AGENTS.md file that works with Open Code CLI. The /init command is not needed for normal use.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Local LLMs Configuration

Table of Contents

Open Code Configuration

Current Configuration

Available Models

Custom Model Creation

Creating Qwen3 8B with Extended Context (16k)

Benefits of Extended Context

Context Window Reality Check

Ollama Commands Reference

List Models

Run Model

Pull New Model

Remove Model

Interactive Commands (within `ollama run`)

Model Selection Guidelines

Updating Configuration

Troubleshooting

Ollama Not Running

Model Not Found

Out of Memory

Slow Performance with Local Models

Open Code CLI "Cannot Read Binary File" Error

Open Code CLI `/init` Mode Issues

Resources

FilesExpand file tree

LOCALLLMS.md

Latest commit

History

LOCALLLMS.md

File metadata and controls

Local LLMs Configuration

Table of Contents

Open Code Configuration

Current Configuration

Available Models

Custom Model Creation

Creating Qwen3 8B with Extended Context (16k)

Benefits of Extended Context

Context Window Reality Check

Ollama Commands Reference

List Models

Run Model

Pull New Model

Remove Model

Interactive Commands (within ollama run)

Model Selection Guidelines

Updating Configuration

Troubleshooting

Ollama Not Running

Model Not Found

Out of Memory

Slow Performance with Local Models

Open Code CLI "Cannot Read Binary File" Error

Open Code CLI /init Mode Issues

Resources

Interactive Commands (within `ollama run`)

Open Code CLI `/init` Mode Issues