Custom (OpenAI-compatible)

Use this for any endpoint that speaks OpenAI's chat-completions API. Common cases:

vLLM — open-source serving for Llama, Qwen, Mistral, etc.
Ollama — local inference on your laptop / on-prem GPU
LM Studio — desktop GUI inference
Self-hosted GPU box — your own model behind a vLLM / TGI server
LiteLLM — proxy that exposes any provider as OpenAI-compatible

Configuration

~/.insightworker/.env:

LLM_PROVIDER=custom
CUSTOM_LLM_BASE_URL=http://gpu-box.internal:8000/v1
CUSTOM_LLM_API_KEY=optional-if-required
CUSTOM_LLM_MODEL=meta-llama/Llama-3.1-70B-Instruct

The model name is whatever your endpoint expects — typically the HuggingFace model id for vLLM, or the local model tag for Ollama (e.g. llama3.1:70b-instruct).

Common endpoint URLs

Service	Base URL pattern
vLLM	`http://<host>:8000/v1`
Ollama	`http://<host>:11434/v1`
LM Studio	`http://localhost:1234/v1`
TGI (HuggingFace)	`http://<host>:80/v1`
LiteLLM proxy	`http://<host>:4000/v1`

The trailing /v1 matters — that's the OpenAI-compatible path prefix.

Requirements for the endpoint

To work with InsightWorker's agent loop, the endpoint must:

Implement OpenAI's chat-completions schema — /v1/chat/completions with messages, tools, tool_choice parameters
Support tool calling — the model must be instruct-tuned for tool use. Most modern instruct models do (Llama 3.1+, Qwen 2.5+, Mistral Large, DeepSeek-V3, etc.). Some chat-only models don't.
Return tool_calls in the standard shape — { id, type: "function", function: { name, arguments: <json string> } }

If your model "chats fine but never calls tools", the model's instruction-tuning isn't strong enough for tool calling. Try a different model — Llama 3.1 70B Instruct is a known-good baseline.

Security warning

Prompts and tool outputs are sent to whatever URL you configure. Only point at endpoints you trust. The base URL is never auto-set; you opt in by setting the env var.

If you're running on-prem and your CUSTOM_LLM_BASE_URL points at an internal service, no concern. If you're pointing at a third-party hosted endpoint, treat it the same way you'd treat sending data to OpenAI direct.

What InsightWorker doesn't try to do for custom endpoints

max_completion_tokens vs max_tokens — we send max_tokens. If your endpoint requires the new param, you'll need a proxy or wait for us to add a config knob.
reasoning_effort — we don't forward this for the custom flavour, since most OpenAI-compatible endpoints don't understand it.
Responses API — we don't route to /v1/responses for the custom flavour. If your endpoint requires that path for certain models, route through LiteLLM or a similar proxy.

Performance tuning

For local/on-prem hosting:

Quantization: Q4 / Q8 / FP16 — pick based on your VRAM. A 70B model in Q4 fits in ~40GB of VRAM.
Batch size: vLLM auto-batches; tune --max-num-seqs if you have many concurrent agent runs.
Context window: agent loop sets maxTokens: 16384. Make sure your serving config allows that output budget.

Common gotchas

Symptom	Cause	Fix
Connection refused	Endpoint not running or wrong port	`curl <base_url>/models` to confirm
Tool calls never happen	Model not tool-tuned	Switch to Llama 3.1 Instruct or similar
Tool args malformed JSON	Smaller models drift	InsightWorker reports a clean error; switch models or constrain output
401 Unauthorized	Endpoint requires auth, key not set	Set `CUSTOM_LLM_API_KEY`
Slow inference	Quantization too aggressive or batch size off	Profile your serving stack

Pairing with other tools

You can mix and match: use a self-hosted Llama as your LLM (LLM_PROVIDER=custom) while still using Bedrock-hosted Claude for specific high-stakes calls via a custom skill, or use OpenAI's gpt-5-pro only for reasoning tasks. Per-skill provider routing isn't built in yet but is on the roadmap.