Docker Profiles¶

Peregrine uses Docker Compose profiles to start only the services your hardware can support. Choose a profile with make start PROFILE=<name>.

Profile Reference¶

Profile	Services started	Use case
`remote`	`app`, `searxng`	No GPU. LLM calls go to an external API (Anthropic, OpenAI-compatible).
`cpu`	`app`, `ollama`, `searxng`	No GPU. Runs local models on CPU — functional but slow.
`single-gpu`	`app`, `ollama`, `vision`, `searxng`	One NVIDIA GPU. Covers cover letters, research, and vision (survey screenshots).
`dual-gpu`	`app`, `ollama`, `vllm`, `vision`, `searxng`	Two NVIDIA GPUs. GPU 0 = Ollama (cover letters), GPU 1 = vLLM (research).

Service Descriptions¶

Service	Image / Source	Port	Purpose
`app`	`Dockerfile` (Streamlit)	8501	The main Peregrine UI
`ollama`	`ollama/ollama`	11434	Local model inference — cover letters and general tasks
`vllm`	`vllm/vllm-openai`	8000	High-throughput local inference — research tasks
`vision`	`scripts/vision_service/`	8002	Moondream2 — survey screenshot analysis
`searxng`	`searxng/searxng`	8888	Private meta-search engine — company research web scraping

Choosing a Profile¶

remote¶

Use remote if: - You have no NVIDIA GPU - You plan to use Anthropic Claude or another API-hosted model exclusively - You want the fastest startup (only two containers)

You must configure at least one external LLM backend in Settings → LLM Backends.

cpu¶

Use cpu if: - You have no GPU but want to run models locally (e.g. for privacy) - Acceptable for light use; cover letter generation may take several minutes per request

Pull a model after the container starts:

docker exec -it peregrine-ollama-1 ollama pull llama3.1:8b

single-gpu¶

Use single-gpu if: - You have one NVIDIA GPU with at least 8 GB VRAM - Recommended for most single-user installs - The vision service (Moondream2) starts on the same GPU using 4-bit quantisation (~1.5 GB VRAM)

dual-gpu¶

Use dual-gpu if: - You have two or more NVIDIA GPUs - GPU 0 handles Ollama (cover letters, quick tasks) - GPU 1 handles vLLM (research, long-context tasks) - The vision service shares GPU 0 with Ollama

GPU Memory Guidance¶

GPU VRAM	Recommended profile	Notes
< 4 GB	`cpu`	GPU too small for practical model loading
4–8 GB	`single-gpu`	Run smaller models (3B–8B parameters)
8–16 GB	`single-gpu`	Run 8B–13B models comfortably
16–24 GB	`single-gpu`	Run 13B–34B models
24 GB+	`single-gpu` or `dual-gpu`	70B models with quantisation

How preflight.py Works¶

make start calls scripts/preflight.py before launching Docker. Preflight does the following:

Port conflict detection — checks whether STREAMLIT_PORT, OLLAMA_PORT, VLLM_PORT, SEARXNG_PORT, and VISION_PORT are already in use. Reports any conflicts and suggests alternatives.
GPU enumeration — queries nvidia-smi for GPU count and VRAM per card.
RAM check — reads /proc/meminfo (Linux) or vm_stat (macOS) to determine available system RAM.
KV cache offload — if GPU VRAM is less than 10 GB, preflight calculates CPU_OFFLOAD_GB (the amount of KV cache to spill to system RAM) and writes it to .env. The vLLM container picks this up via --cpu-offload-gb.
Profile recommendation — writes RECOMMENDED_PROFILE to .env. This is informational; make start uses the PROFILE variable you specify (defaulting to remote).

You can run preflight independently:

make preflight
# or
python scripts/preflight.py

Customising Ports¶

Edit .env before running make start:

STREAMLIT_PORT=8501
OLLAMA_PORT=11434
VLLM_PORT=8000
SEARXNG_PORT=8888
VISION_PORT=8002

All containers read from .env via the env_file directive in compose.yml.