Privacy-first, fully offline cross-lingual document QA for migrants and newcomers.
Upload a foreign-language legal document (e.g. a Chinese lease agreement) and ask questions in your own language (e.g. Polish). Get answers with source quotes and a hallucination trust score: entirely on your device, nothing sent to the cloud.
Built for the Cohere AI Hackathon · March 10–24, 2026
| Component | Technology |
|---|---|
| Python | Python 3.11 (pinned for ML library stability) |
| Package Manager | UV (modern, fast dependency management) |
| LLM | Tiny Aya 3.35B GGUF via llama-server (local C++ inference) |
| Embeddings | BAAI/bge-m3 (cross-lingual sentence transformers) |
| Vector DB | ChromaDB (local persistence) |
| Hallucination Check | mDeBERTa-v3 (NLI Entailment) |
| UI | Gradio |
What is a Makefile?
You'll see commands like make install throughout this guide. A Makefile is just a shortcuts file — like the buttons on a washing machine. You don't need to know the exact spin speed; you just press "Quick Wash." When you type make install, it runs about a dozen setup commands behind the scenes so you don't have to.
Why NOT ollama or llama-cpp-python?
Both Python wrappers currently crash on Tiny Aya's custom tokenizer (unknown pre-tokenizer type: tiny_aya). We compile and run llama.cpp directly as a C++ binary (llama-server). This is the only path that works reliably on all platforms.
You need a C++ build environment so we can compile the AI inference engine.
Mac:
brew install cmakeLinux:
sudo apt install cmake build-essentialWindows:
Install Visual Studio with the "Desktop development with C++" workload, and CMake.
HuggingFace (everyone):
You must have a HuggingFace account and agree to the model terms before downloading:
→ CohereLabs/tiny-aya-global
→ CohereLabs/tiny-aya-fire
Run these commands one by one in your terminal:
# 1. Clone the repo
git clone https://github.com/docunative-AI/docunative.git
cd docunative
# 2. Install uv — fast Python package manager (once per machine)
# macOS
brew install uv
# Linux / Windows (Git Bash or WSL)
curl -LsSf https://astral.sh/uv/install.sh | sh
# 3. Create Python virtual environment and install all packages + models
make install
# This will:
# - Create a Python 3.11 virtual environment
# - Install all dependencies (~30 seconds)
# - Download Tiny Aya models if not present (~4.2 GB, 5-10 minutes)
# 4. Authenticate with HuggingFace (if models need downloading)
huggingface-cli loginSetup is done. You only ever need to run Step 2 once.
Note: The first make install takes 5-10 minutes (downloads models). Subsequent runs take ~30 seconds (models already present).
DocuNative is 100% offline. The AI model runs as a background server on your machine — so you need two terminal windows open at the same time.
⚠️ If you close Terminal 1, the app in Terminal 2 will crash.
🟢 Terminal 1 — Start the AI server (do this first)
This loads the 2GB model into RAM and keeps it running on port 8080.
# Mac / Linux
make server-global
# Windows
models\start_server.bat globalWait until you see:
llama server listening at http://127.0.0.1:8080
Leave this terminal open.
🔵 Terminal 2 — Launch the UI
Open a brand new terminal window:
cd docunative
source .venv/bin/activate # Windows: .venv\Scripts\activate
make demo🎉 Open http://localhost:7860 in your browser. You're running DocuNative.
We have two model variants to test. To switch, stop Terminal 1 and restart it:
# The multilingual generalist (default) — GPU/Metal with prompt caching
make server-global
# The South Asian specialist (H1 — Hindi document QA)
make server-fire
# CPU users (Windows/Linux) — lower quantization for survivable latency
make server-global-q3 # ~30% faster than Q4 on CPU
make server-global-iq2 # ~60% faster, some quality lossThe UI model selector also reflects which model is currently loaded.
Mac users: If generation feels slow, run
make check-metalto verify Metal is active.
At any point, verify the server is alive:
curl http://localhost:8080/health
# Expected: {"status":"ok"}If the health check fails:
- Check that Terminal 1 is still open and running
- Check that the model finished loading (look for the
listeningline) - Try
make server-globalagain from scratch
┌────────────────────────────────────────────────────────────────┐
│ Gradio UI (port 7860) │
│ PDF upload · Language selector · NLI trust badge │
└────────────────────────────────┬───────────────────────────────┘
│
┌──────────────────▼──────────────────┐
│ pipeline/ │
│ extract.py (PDF → text) │
│ embed.py (text → vecs) │ ← BAAI/bge-m3
│ retrieve.py (vecs → top3) │ ← ChromaDB (local)
│ generate.py (top3 → answer) │ ← llama-server :8080
│ validate.py (answer → struct) │
│ nli.py (hallucination check) │ ← mDeBERTa-v3
└──────────────────┬──────────────────┘
│
┌──────────────────▼──────────────────┐
│ llama-server (port 8080) │
│ Tiny Aya GGUF · C++ binary │
│ Metal / CUDA / CPU │
└─────────────────────────────────────┘
The setup script compiles llama.cpp automatically with the best backend for your hardware:
- macOS ARM64 / x86_64 — Metal (Apple Silicon or Intel GPU)
- Linux with NVIDIA GPU — CUDA
- Linux / Windows (CPU only) — OpenBLAS
See ROADMAP.md for the full dependency graph.
We are building this pipeline to answer two specific questions for our Hackathon paper:
H1 — The Specialist Advantage
Does Tiny Aya Fire (South Asian specialist, trained with 5.8% South Asian data) outperform Tiny Aya Global (generalist) on Hindi legal document QA?
To test H1: stop Terminal 1, restart with make server-fire, ask the same questions in Hindi, compare results.
H2 — Internal Training Proportion Gradient
Does accuracy degrade as the language's internal training proportion in Tiny Aya decreases?
We test across three languages with a clean step gradient in Tiny Aya's All Regions training mix (Appendix A, Tiny Aya technical report):
| Language | Internal % | External NLP Resources |
|---|---|---|
| Chinese (Simplified) | 1.9% | High — vast web presence, strong NLP ecosystem |
| Hindi | 1.7% | Medium — growing rapidly, good tooling |
| Polish | 1.4% | Medium-low — smaller NLP research community |
All three languages are natively supported by Aya Expanse 32B (used for document generation), eliminating any document quality confound. The 0.5% gradient from Chinese to Polish is the widest achievable while maintaining clean document generation.
Why this framing matters: Tiny Aya deliberately balances training data across languages to reduce the curse of multilinguality. If we still observe zh > hi > pl performance despite near-equal training proportions, it suggests structural and linguistic factors matter independently of training data quantity. If we observe no gradient, it confirms Tiny Aya's balancing technique achieves its design goal for document QA — itself a novel finding.
To test H2 yourself: run make server-global, then run python -m eval.evaluate --qa dataset/output/qa_pairs.jsonl --docs dataset/output --model Global --run-name eval1-h2.
The inference server runs a compiled C++ binary (llama-server) on port 8080. The setup script automatically:
- Clones llama.cpp if not already present
- Compiles with the appropriate backend:
- macOS ARM64: Metal acceleration (Apple Silicon)
- macOS x86_64: Metal acceleration
- Linux with CUDA: GPU acceleration
- Linux without CUDA: CPU-only
- Windows: CPU-only (requires Visual Studio with C++ tools)
- Starts the server on port 8080
# Start with the global model
make server-global
# Or start with the fire model (H1 — South Asian specialist)
make server-fire# Start with the global model
models\start_server.bat global
# Or start with the fire model (H1 — South Asian specialist)
models\start_server.bat fireVerify the server is running:
curl http://localhost:8080/health
# Should return: {"status":"ok"}
⚠️ Note: We do NOT use ollama or llama-cpp-python. The model runs via llama-server (compiled C++ binary) on port 8080.
The full evaluation pipeline tests DocuNative against 3,600 synthetic QA pairs across Chinese, Hindi, and Polish documents.
Evaluation sets:
- Eval 1 — Template QA: English questions from deterministic seed facts
- Eval 2 — LLM QA: Questions generated IN the document language by Aya Expanse 32B
Step 1 — Generate documents (first time only):
python -m dataset.builder.writer --language zh
python -m dataset.builder.writer --language hi
python -m dataset.builder.writer --language plStep 2 — Generate QA pairs:
# Eval 1 — template QA
python -m dataset.builder.qa_factory --full
# Eval 2 — LLM QA in document language
python -m dataset.builder.qa_factory_llmStep 3 — Pre-compute embeddings (run once, saves ~39 min per eval run):
python -m eval.precompute_embeddings --docs dataset/output/Step 4 — Start the server (Terminal 1):
make server-globalStep 5 — Run Eval 1 — H2 (Terminal 2):
# Full run — 3,600 pairs (~4-5 hours on Metal)
# Writes eval_results_eval1-h2.jsonl and eval_report_eval1-h2.txt
python -m eval.evaluate \
--qa dataset/output/qa_pairs.jsonl \
--docs dataset/output \
--model Global \
--run-name eval1-h2Step 6 — Run Eval 1 — H1 (Fire vs Global on Hindi):
make server-fire # Terminal 1
python -m eval.evaluate \
--qa dataset/output/qa_pairs.jsonl \
--docs dataset/output \
--model Fire --language hi \
--run-name eval1-h1-fire
make server-global # Terminal 1
python -m eval.evaluate \
--qa dataset/output/qa_pairs.jsonl \
--docs dataset/output \
--model Global --language hi \
--run-name eval1-h1-globalStep 7 — Run Eval 2 — LLM QA:
make server-global # Terminal 1
python -m eval.evaluate \
--qa dataset/output/qa_pairs_llm.jsonl \
--docs dataset/output \
--model Global \
--run-name eval2-llmEval 1 / Eval 2 outputs go to eval/results/ as eval_results_<run-name>.jsonl and eval_report_<run-name>.txt (see --run-name above). Eval 3 writes eval_mkqa_results.jsonl and eval_mkqa_report.txt.
Note:
eval/results/anddataset/output/*.jsonlare gitignored. Share output files manually.
DocuNative uses UV for dependency management. This provides faster installs, reproducible builds, and better dependency resolution.
📖 Complete UV Development Guide: docs/uv-development-guide.md
The guide covers:
- Setting up new branches with
uv sync - Installing packages (temporary vs. permanent)
- Adding dependencies to
pyproject.toml - Updating dependencies (
uv lock --upgrade) - Common commands reference
- Troubleshooting
- Best practices
- Complete example workflows
Quick Start for Contributors:
# 1. Create your branch
git checkout -b feature/your-feature
# 2. Install dependencies
uv sync
# 3. Start coding!
# On Linux/macOS
source .venv/bin/activate
# On Windows
source .venv/Scripts/activate
⚠️ Adding new packages? Always useuv add— this updates bothpyproject.tomlanduv.lockatomically.uv add <package> git add pyproject.toml uv.lock
For detailed workflows, see the full guide linked above.
- Strictly offline. The UI and RAG pipeline are forbidden from calling any cloud API.
- No Python wrappers. We compile raw
llama.cppvia CMake. Ollama andllama-cpp-pythonboth fail on Tiny Aya's tokenizer. - Two-terminal setup. Terminal 1 = server. Terminal 2 = UI. Always.
MIT License · Built during the Cohere AI Hackathon, March 2026