Docker Profiles¶

Peregrine uses Docker Compose profiles to start only the services your hardware can support. Choose a profile with make start PROFILE=<name>.

Profile Reference¶

Profile	Services started	Use case
`remote`	`app`, `searxng`	No GPU. LLM calls go to an external API (Anthropic, OpenAI-compatible).
`cpu`	`app`, `ollama`, `searxng`	No GPU. Runs local models on CPU. On Apple Silicon, pairs with native Ollama for Metal GPU inference — see below.
`single-gpu`	`app`, `ollama`, `vision`, `searxng`	One NVIDIA GPU. Covers cover letters, research, and vision (survey screenshots).
`dual-gpu`	`app`, `ollama`, `vllm`, `vision`, `searxng`	Two NVIDIA GPUs. GPU 0 = Ollama (cover letters), GPU 1 = vLLM (research).

Service Descriptions¶

Service	Image / Source	Port	Purpose
`app`	`Dockerfile` (Streamlit)	8501	The main Peregrine UI
`ollama`	`ollama/ollama`	11434	Local model inference — cover letters and general tasks
`vllm`	`vllm/vllm-openai`	8000	High-throughput local inference — research tasks
`vision`	`scripts/vision_service/`	8002	Moondream2 — survey screenshot analysis
`searxng`	`searxng/searxng`	8888	Private meta-search engine — company research web scraping

Choosing a Profile¶

remote¶

Use remote if: - You have no NVIDIA GPU - You plan to use Anthropic Claude or another API-hosted model exclusively - You want the fastest startup (only two containers)

You must configure at least one external LLM backend in Settings → LLM Backends.

cpu¶

Use cpu if: - You have no GPU but want to run models locally (e.g. for privacy) - You are on macOS / Apple Silicon and want Metal GPU acceleration (see Apple Silicon GPU below) - Acceptable for light use on CPU; cover letter generation may take several minutes per request

Pull a model after the container starts:

docker exec -it peregrine-ollama-1 ollama pull llama3.1:8b

Or, if Ollama is running natively (adopted by preflight):

ollama pull llama3.1:8b

Apple Silicon GPU¶

Docker Desktop runs in a Linux VM on macOS and cannot access the Apple GPU. The GPU profiles (single-gpu, dual-gpu) require NVIDIA hardware and are not available on Mac.

Metal-accelerated inference is available via native Ollama. When Ollama is running natively on port 11434, preflight.py detects it, stubs out the Docker Ollama container so there's no conflict, and routes inference through the native process — which uses Metal automatically.

# Install and start native Ollama (setup.sh offers to do this automatically)
brew install ollama
brew services start ollama

# Start Peregrine — preflight adopts native Ollama
./manage.sh start --profile cpu

The cpu profile is the correct choice on macOS even when using the Apple GPU, because it starts the right set of Docker services without requiring NVIDIA GPU reservations. Inference performance will reflect Metal acceleration, not CPU speed.

preflight.py detects the Apple Silicon GPU via system_profiler and reports it in the preflight output:

║    GPU      Apple M3 Pro  (Apple Silicon, unified memory)
║             18.4 / 36.0 GB RAM available to GPU
║    ⚡  Apple Silicon GPU detected.
║       Docker cannot access Metal — install Ollama natively for GPU inference:
║         brew install ollama && brew services start ollama

Once native Ollama is adopted:

║    ⚡  Native Ollama detected — Metal GPU acceleration active

single-gpu¶

Use single-gpu if: - You have one NVIDIA GPU with at least 8 GB VRAM - Recommended for most single-user installs - The vision service (Moondream2) starts on the same GPU using 4-bit quantisation (~1.5 GB VRAM)

dual-gpu¶

Use dual-gpu if: - You have two or more NVIDIA GPUs - GPU 0 handles Ollama (cover letters, quick tasks) - GPU 1 handles vLLM (research, long-context tasks) - The vision service shares GPU 0 with Ollama

GPU Memory Guidance¶

GPU VRAM	Recommended profile	Notes
< 4 GB	`cpu`	GPU too small for practical model loading
4–8 GB	`single-gpu`	Run smaller models (3B–8B parameters)
8–16 GB	`single-gpu`	Run 8B–13B models comfortably
16–24 GB	`single-gpu`	Run 13B–34B models
24 GB+	`single-gpu` or `dual-gpu`	70B models with quantisation

How preflight.py Works¶

make start calls scripts/preflight.py before launching Docker. Preflight does the following:

Port conflict detection — checks whether STREAMLIT_PORT, OLLAMA_PORT, VLLM_PORT, SEARXNG_PORT, and VISION_PORT are already in use. Reports any conflicts and suggests alternatives.
GPU enumeration — queries nvidia-smi for NVIDIA GPU count and VRAM per card (Linux). On macOS, falls back to system_profiler SPDisplaysDataType to detect Apple Silicon GPU; unified RAM is reported as the GPU memory figure.
RAM check — reads /proc/meminfo (Linux) or vm_stat (macOS) to determine available system RAM.
KV cache offload — if GPU VRAM is less than 10 GB, preflight calculates CPU_OFFLOAD_GB (the amount of KV cache to spill to system RAM) and writes it to .env. The vLLM container picks this up via --cpu-offload-gb.
Profile recommendation — writes RECOMMENDED_PROFILE to .env. This is informational; make start uses the PROFILE variable you specify (defaulting to remote).

You can run preflight independently:

make preflight
# or
python scripts/preflight.py

Customising Ports¶

Edit .env before running make start:

STREAMLIT_PORT=8501
OLLAMA_PORT=11434
VLLM_PORT=8000
SEARXNG_PORT=8888
VISION_PORT=8002

All containers read from .env via the env_file directive in compose.yml.