Skip to main content

OpenAI-compatible vs custom

EcoLink supports two container shapes for user inference instances:

  1. OpenAI-compatible (default, recommended) — your container exposes standard /v1/... endpoints
  2. Custom API path — your container exposes a different HTTP surface; EcoLink proxies through

Pick based on what your inference server speaks.

OpenAI-compatible mode

Most modern inference servers speak the OpenAI schema:

  • vLLM: vllm/vllm-openai:latest — OpenAI-compatible out of the box
  • Text Generation Inference (TGI): ghcr.io/huggingface/text-generation-inference:3.0.0 — has /v1/chat/completions and /v1/completions
  • SGLang: lmsysorg/sglang:latest — has OpenAI-compat mode
  • LMDeploy, Ollama, llama.cpp (--api): all support OpenAI-compatible endpoints
  • Embedding servers: infinity, text-embeddings-inference

How it works

When you call:

curl https://api.ecohash.com/v1/chat/completions \
-H "Authorization: Bearer eco_YOUR_KEY" \
-d '{ "model": "my-llama:142", "messages": [...] }'

EcoLink:

  1. Parses my-llama:142 → instance ID = 142
  2. Validates your API key owns that instance
  3. Picks a healthy region (routes to the best one)
  4. Forwards the request to that region's pod
  5. Strips the :142 suffix from the model field before forwarding (so the container sees model: "my-llama")
  6. Streams the response back

The container never sees the instance ID or region routing logic — it just sees a standard OpenAI request.

Advantages

  • Zero code in your app — use the OpenAI SDK, point base URL at api.ecohash.com/v1, done
  • Works in the console Playground — your instance appears as a selectable model
  • Regional failover — if one region's pod is unhealthy, EcoLink routes to another
  • Unified billing — per-request costs surface in Account → Billing just like platform models

Endpoints supported

All six OpenAI-compatible endpoints route the same way:

CategoryEndpoint
Chat / vision LLMPOST /v1/chat/completions
EmbeddingsPOST /v1/embeddings
Image generationPOST /v1/images/generations
Text-to-speechPOST /v1/audio/speech
Speech-to-textPOST /v1/audio/transcriptions
Video generationPOST /v1/video/generations

The one your instance handles is determined by its registered category (chat / embedding / image / audio / video).

Custom API path mode

Use when your container speaks a non-OpenAI protocol — a bespoke API, gRPC, an older HF interface, or anything custom.

Setup

  1. At launch time, uncheck "Container is OpenAI-compatible".
  2. Set Custom API path, e.g., /generate or /api/v1/predict.

EcoLink exposes a proxy URL instead of the unified API:

https://api.ecohash.com/inference-instances/<id>/proxy/<your_custom_path>

Calling it

curl https://api.ecohash.com/inference-instances/142/proxy/generate \
-H "Authorization: Bearer eco_YOUR_KEY" \
-H "Content-Type: application/json" \
-d '{ "prompt": "Hello", "max_tokens": 100 }'

Request body is forwarded as-is. Response is streamed back as-is. Authentication is still your EcoLink API key.

Tradeoffs

FeatureOpenAI-compatibleCustom
SDK supportEvery OpenAI SDK worksNeed raw HTTP
Playground visibilityYesNo
Unified API URLYes (/v1/...)No (/inference-instances/<id>/proxy/...)
Regional failoverYesYes (same routing logic)
BillingPer-request tokens / images / secondsFlat GPU-hour only (no per-request metering)
Setup simplicityZero — use existing server imagesNeed to know your container's custom schema

When to use custom mode

  • You have a fine-tuned non-LLM model (e.g., a custom diffusion model or retrieval pipeline)
  • Your container speaks a legacy API you don't want to change
  • You want raw HTTP passthrough for a custom use case

For new deployments, strongly prefer OpenAI-compatible — it's simpler for your users and integrates with the rest of the platform.

How to make your container OpenAI-compatible

If you're writing a custom inference server, here are the three endpoints you'd implement for LLM use:

  • POST /v1/chat/completions — standard OpenAI schema
  • GET /v1/models — return your served model (used by clients to enumerate)
  • GET /health — simple 200 OK for the pod readiness probe

Use one of the pre-built servers (vLLM, TGI, SGLang) unless you have a specific reason to roll your own. They handle streaming, batching, tokenization, error cases — lots of subtle behavior.