Custom (OpenAI-compatible)
Use this for any endpoint that speaks OpenAI's chat-completions API. Common cases:
- vLLM — open-source serving for Llama, Qwen, Mistral, etc.
- Ollama — local inference on your laptop / on-prem GPU
- LM Studio — desktop GUI inference
- Self-hosted GPU box — your own model behind a vLLM / TGI server
- LiteLLM — proxy that exposes any provider as OpenAI-compatible
Configuration
~/.insightworker/.env:
LLM_PROVIDER=custom
CUSTOM_LLM_BASE_URL=http://gpu-box.internal:8000/v1
CUSTOM_LLM_API_KEY=optional-if-required
CUSTOM_LLM_MODEL=meta-llama/Llama-3.1-70B-Instruct
The model name is whatever your endpoint expects — typically the HuggingFace model id for vLLM, or the local model tag for Ollama (e.g. llama3.1:70b-instruct).
Common endpoint URLs
| Service | Base URL pattern |
|---|---|
| vLLM | http://<host>:8000/v1 |
| Ollama | http://<host>:11434/v1 |
| LM Studio | http://localhost:1234/v1 |
| TGI (HuggingFace) | http://<host>:80/v1 |
| LiteLLM proxy | http://<host>:4000/v1 |
The trailing /v1 matters — that's the OpenAI-compatible path prefix.
Requirements for the endpoint
To work with InsightWorker's agent loop, the endpoint must:
- Implement OpenAI's chat-completions schema —
/v1/chat/completionswithmessages,tools,tool_choiceparameters - Support tool calling — the model must be instruct-tuned for tool use. Most modern instruct models do (Llama 3.1+, Qwen 2.5+, Mistral Large, DeepSeek-V3, etc.). Some chat-only models don't.
- Return tool_calls in the standard shape —
{ id, type: "function", function: { name, arguments: <json string> } }
If your model "chats fine but never calls tools", the model's instruction-tuning isn't strong enough for tool calling. Try a different model — Llama 3.1 70B Instruct is a known-good baseline.
Security warning
Prompts and tool outputs are sent to whatever URL you configure. Only point at endpoints you trust. The base URL is never auto-set; you opt in by setting the env var.
If you're running on-prem and your CUSTOM_LLM_BASE_URL points at an internal service, no concern. If you're pointing at a third-party hosted endpoint, treat it the same way you'd treat sending data to OpenAI direct.
What InsightWorker doesn't try to do for custom endpoints
max_completion_tokensvsmax_tokens— we sendmax_tokens. If your endpoint requires the new param, you'll need a proxy or wait for us to add a config knob.reasoning_effort— we don't forward this for the custom flavour, since most OpenAI-compatible endpoints don't understand it.- Responses API — we don't route to
/v1/responsesfor the custom flavour. If your endpoint requires that path for certain models, route through LiteLLM or a similar proxy.
Performance tuning
For local/on-prem hosting:
- Quantization: Q4 / Q8 / FP16 — pick based on your VRAM. A 70B model in Q4 fits in ~40GB of VRAM.
- Batch size: vLLM auto-batches; tune
--max-num-seqsif you have many concurrent agent runs. - Context window: agent loop sets
maxTokens: 16384. Make sure your serving config allows that output budget.
Common gotchas
| Symptom | Cause | Fix |
|---|---|---|
| Connection refused | Endpoint not running or wrong port | curl <base_url>/models to confirm |
| Tool calls never happen | Model not tool-tuned | Switch to Llama 3.1 Instruct or similar |
| Tool args malformed JSON | Smaller models drift | InsightWorker reports a clean error; switch models or constrain output |
| 401 Unauthorized | Endpoint requires auth, key not set | Set CUSTOM_LLM_API_KEY |
| Slow inference | Quantization too aggressive or batch size off | Profile your serving stack |
Pairing with other tools
You can mix and match: use a self-hosted Llama as your LLM (LLM_PROVIDER=custom) while still using Bedrock-hosted Claude for specific high-stakes calls via a custom skill, or use OpenAI's gpt-5-pro only for reasoning tasks. Per-skill provider routing isn't built in yet but is on the roadmap.
See also
- openai.md — OpenAI direct (the "chat-completions API" reference)
- overview.md — full provider matrix
Source: docs/providers/custom-openai-compatible.md in the public repo. Open a PR with corrections.
