Vision AI - Analyse Images and Documents On-Device
Off Grid’s vision models can look at images and answer questions about them. Point your camera at a document, a product, a diagram, a receipt - and ask anything.
All inference runs on-device via llama.rn’s multimodal support. No image is uploaded anywhere.
What you can do
- Read receipts, invoices, business cards - extract text from photos
- Describe scenes - understand what’s in a photo
- Analyse documents - ask questions about a photo of a document
- Identify objects - “what is this?” with a photo
- Read handwriting - with capable models like Qwen3-VL
- Code from screenshots - show the model a UI and ask it to recreate the code
Available vision models
| Model | Params | Min RAM | Speed | Best for |
|---|---|---|---|---|
| SmolVLM2 500M | 0.5B | 3GB | Very fast (~7s) | Quick visual Q&A on low-RAM devices |
| SmolVLM 2B | 2B | 4GB | Fast (~8s) | General vision tasks |
| SmolVLM2 2.2B | 2.2B | 4GB | Fast (~8–10s) | Vision + video understanding |
| Gemma 4 E2B | 2B (MoE) | 4GB | Medium (~10–15s) | Best vision quality for 4GB, thinking mode |
| Gemma 4 E4B | 4B (MoE) | 6GB | Medium (~12–18s) | Strongest reasoning + vision, thinking mode |
| Qwen3-VL 2B | 2B | 4GB | Medium | Multilingual vision, thinking mode |
Gemma 4 models support both vision and thinking mode together - they can reason step-by-step about what they see, which dramatically improves accuracy on complex tasks.
How to use vision
- Open a chat in Off Grid
- Tap the attachment icon → choose Camera or Photo Library
- Select or capture your image
- Type your question and send
The model receives both the image and your question. Vision models automatically download a companion mmproj file (multimodal projector) during setup - this is included in the model size estimate.
Example prompts
Document analysis:
“What are the line items on this receipt? Give me a total.”
Technical reading:
“Explain this architecture diagram.”
Handwriting:
“Transcribe the text in this photo.”
Visual Q&A:
“What model of phone is shown in this photo?”
Code from UI:
“Write the React Native code to recreate this screen.”
Tips
Use Gemma 4 for complex reasoning - If you need the model to think carefully about what it sees (e.g. interpreting a chart, solving a problem from a photo), Gemma 4’s thinking mode produces much better results than a faster model.
Use SmolVLM for quick tasks - For simple description or text extraction, SmolVLM2 500M is surprisingly capable and much faster.
Image quality matters - Blurry or low-contrast photos degrade accuracy significantly. For documents, flat lighting and a straight-on angle work best.