InsightWorker Logo
  • contact@verticalserve.com
Docs / LLM providers / Custom (OpenAI-compatible)

Custom (OpenAI-compatible)

Use this for any endpoint that speaks OpenAI's chat-completions API. Common cases:

  • vLLM — open-source serving for Llama, Qwen, Mistral, etc.
  • Ollama — local inference on your laptop / on-prem GPU
  • LM Studio — desktop GUI inference
  • Self-hosted GPU box — your own model behind a vLLM / TGI server
  • LiteLLM — proxy that exposes any provider as OpenAI-compatible

Configuration

~/.insightworker/.env:

LLM_PROVIDER=custom
CUSTOM_LLM_BASE_URL=http://gpu-box.internal:8000/v1
CUSTOM_LLM_API_KEY=optional-if-required
CUSTOM_LLM_MODEL=meta-llama/Llama-3.1-70B-Instruct

The model name is whatever your endpoint expects — typically the HuggingFace model id for vLLM, or the local model tag for Ollama (e.g. llama3.1:70b-instruct).

Common endpoint URLs

ServiceBase URL pattern
vLLMhttp://<host>:8000/v1
Ollamahttp://<host>:11434/v1
LM Studiohttp://localhost:1234/v1
TGI (HuggingFace)http://<host>:80/v1
LiteLLM proxyhttp://<host>:4000/v1

The trailing /v1 matters — that's the OpenAI-compatible path prefix.

Requirements for the endpoint

To work with InsightWorker's agent loop, the endpoint must:

  1. Implement OpenAI's chat-completions schema/v1/chat/completions with messages, tools, tool_choice parameters
  2. Support tool calling — the model must be instruct-tuned for tool use. Most modern instruct models do (Llama 3.1+, Qwen 2.5+, Mistral Large, DeepSeek-V3, etc.). Some chat-only models don't.
  3. Return tool_calls in the standard shape{ id, type: "function", function: { name, arguments: <json string> } }

If your model "chats fine but never calls tools", the model's instruction-tuning isn't strong enough for tool calling. Try a different model — Llama 3.1 70B Instruct is a known-good baseline.

Security warning

Prompts and tool outputs are sent to whatever URL you configure. Only point at endpoints you trust. The base URL is never auto-set; you opt in by setting the env var.

If you're running on-prem and your CUSTOM_LLM_BASE_URL points at an internal service, no concern. If you're pointing at a third-party hosted endpoint, treat it the same way you'd treat sending data to OpenAI direct.

What InsightWorker doesn't try to do for custom endpoints

  • max_completion_tokens vs max_tokens — we send max_tokens. If your endpoint requires the new param, you'll need a proxy or wait for us to add a config knob.
  • reasoning_effort — we don't forward this for the custom flavour, since most OpenAI-compatible endpoints don't understand it.
  • Responses API — we don't route to /v1/responses for the custom flavour. If your endpoint requires that path for certain models, route through LiteLLM or a similar proxy.

Performance tuning

For local/on-prem hosting:

  • Quantization: Q4 / Q8 / FP16 — pick based on your VRAM. A 70B model in Q4 fits in ~40GB of VRAM.
  • Batch size: vLLM auto-batches; tune --max-num-seqs if you have many concurrent agent runs.
  • Context window: agent loop sets maxTokens: 16384. Make sure your serving config allows that output budget.

Common gotchas

SymptomCauseFix
Connection refusedEndpoint not running or wrong portcurl <base_url>/models to confirm
Tool calls never happenModel not tool-tunedSwitch to Llama 3.1 Instruct or similar
Tool args malformed JSONSmaller models driftInsightWorker reports a clean error; switch models or constrain output
401 UnauthorizedEndpoint requires auth, key not setSet CUSTOM_LLM_API_KEY
Slow inferenceQuantization too aggressive or batch size offProfile your serving stack

Pairing with other tools

You can mix and match: use a self-hosted Llama as your LLM (LLM_PROVIDER=custom) while still using Bedrock-hosted Claude for specific high-stakes calls via a custom skill, or use OpenAI's gpt-5-pro only for reasoning tasks. Per-skill provider routing isn't built in yet but is on the roadmap.

See also

  • openai.md — OpenAI direct (the "chat-completions API" reference)
  • overview.md — full provider matrix

Source: docs/providers/custom-openai-compatible.md in the public repo. Open a PR with corrections.