Docker Profiles¶
Peregrine uses Docker Compose profiles to start only the services your hardware can support. Choose a profile with make start PROFILE=<name>.
Profile Reference¶
| Profile | Services started | Use case |
|---|---|---|
remote |
app, searxng |
No GPU. LLM calls go to an external API (Anthropic, OpenAI-compatible). |
cpu |
app, ollama, searxng |
No GPU. Runs local models on CPU — functional but slow. |
single-gpu |
app, ollama, vision, searxng |
One NVIDIA GPU. Covers cover letters, research, and vision (survey screenshots). |
dual-gpu |
app, ollama, vllm, vision, searxng |
Two NVIDIA GPUs. GPU 0 = Ollama (cover letters), GPU 1 = vLLM (research). |
Service Descriptions¶
| Service | Image / Source | Port | Purpose |
|---|---|---|---|
app |
Dockerfile (Streamlit) |
8501 | The main Peregrine UI |
ollama |
ollama/ollama |
11434 | Local model inference — cover letters and general tasks |
vllm |
vllm/vllm-openai |
8000 | High-throughput local inference — research tasks |
vision |
scripts/vision_service/ |
8002 | Moondream2 — survey screenshot analysis |
searxng |
searxng/searxng |
8888 | Private meta-search engine — company research web scraping |
Choosing a Profile¶
remote¶
Use remote if:
- You have no NVIDIA GPU
- You plan to use Anthropic Claude or another API-hosted model exclusively
- You want the fastest startup (only two containers)
You must configure at least one external LLM backend in Settings → LLM Backends.
cpu¶
Use cpu if:
- You have no GPU but want to run models locally (e.g. for privacy)
- Acceptable for light use; cover letter generation may take several minutes per request
Pull a model after the container starts:
single-gpu¶
Use single-gpu if:
- You have one NVIDIA GPU with at least 8 GB VRAM
- Recommended for most single-user installs
- The vision service (Moondream2) starts on the same GPU using 4-bit quantisation (~1.5 GB VRAM)
dual-gpu¶
Use dual-gpu if:
- You have two or more NVIDIA GPUs
- GPU 0 handles Ollama (cover letters, quick tasks)
- GPU 1 handles vLLM (research, long-context tasks)
- The vision service shares GPU 0 with Ollama
GPU Memory Guidance¶
| GPU VRAM | Recommended profile | Notes |
|---|---|---|
| < 4 GB | cpu |
GPU too small for practical model loading |
| 4–8 GB | single-gpu |
Run smaller models (3B–8B parameters) |
| 8–16 GB | single-gpu |
Run 8B–13B models comfortably |
| 16–24 GB | single-gpu |
Run 13B–34B models |
| 24 GB+ | single-gpu or dual-gpu |
70B models with quantisation |
How preflight.py Works¶
make start calls scripts/preflight.py before launching Docker. Preflight does the following:
-
Port conflict detection — checks whether
STREAMLIT_PORT,OLLAMA_PORT,VLLM_PORT,SEARXNG_PORT, andVISION_PORTare already in use. Reports any conflicts and suggests alternatives. -
GPU enumeration — queries
nvidia-smifor GPU count and VRAM per card. -
RAM check — reads
/proc/meminfo(Linux) orvm_stat(macOS) to determine available system RAM. -
KV cache offload — if GPU VRAM is less than 10 GB, preflight calculates
CPU_OFFLOAD_GB(the amount of KV cache to spill to system RAM) and writes it to.env. The vLLM container picks this up via--cpu-offload-gb. -
Profile recommendation — writes
RECOMMENDED_PROFILEto.env. This is informational;make startuses thePROFILEvariable you specify (defaulting toremote).
You can run preflight independently:
Customising Ports¶
Edit .env before running make start:
All containers read from .env via the env_file directive in compose.yml.