Calling your endpoint
After your inference instance is running, call it the same way you call platform models — just use <model_name>:<instance_id> in the model field.
OpenAI-compatible instances
curl
curl https://api.ecohash.com/v1/chat/completions \
-H "Authorization: Bearer eco_YOUR_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "my-llama:142",
"messages": [
{"role": "user", "content": "Hello from my own model"}
]
}'
Python (OpenAI SDK)
from openai import OpenAI
client = OpenAI(
api_key="eco_YOUR_KEY",
base_url="https://api.ecohash.com/v1",
)
resp = client.chat.completions.create(
model="my-llama:142",
messages=[{"role": "user", "content": "Hello"}],
)
print(resp.choices[0].message.content)
TypeScript
import OpenAI from "openai";
const client = new OpenAI({
apiKey: "eco_YOUR_KEY",
baseURL: "https://api.ecohash.com/v1",
});
const resp = await client.chat.completions.create({
model: "my-llama:142",
messages: [{ role: "user", content: "Hello" }],
});
Streaming
Same as platform models — add "stream": true:
stream = client.chat.completions.create(
model="my-llama:142",
messages=[{"role": "user", "content": "Count to 10"}],
stream=True,
)
for chunk in stream:
if delta := chunk.choices[0].delta.content:
print(delta, end="", flush=True)
Custom (non-OpenAI) instances
For instances launched with OpenAI-compatible unchecked:
curl https://api.ecohash.com/inference-instances/142/proxy/generate \
-H "Authorization: Bearer eco_YOUR_KEY" \
-H "Content-Type: application/json" \
-d '{ "prompt": "Hello", "max_tokens": 100 }'
Replace 142 with your instance ID and generate with your custom_api_path. Request body and response are passed through verbatim.
Sharing across your team
Anyone on your account with an API key can call your instance — it's account-scoped, not user-scoped. Same API URL, same model:instance_id. Team invites work through the Users page.
Using in the playground
OpenAI-compatible instances appear in the console's Playground dropdown (Chat / Image / etc., matching the instance's category):
- Chat playground — your instance appears alongside
meta-llama/Llama-3.1-8B-Instructetc. - Image playground — if it's an image model
- Embeddings / Reranker — if it's one of those
Select your instance from the dropdown, chat / prompt as usual. Every call bills the same way as an API call.
Custom (non-OpenAI) instances do not appear in the Playground — use curl or your own client to test them.
How routing works
When you call /v1/chat/completions with model: "my-llama:142":
- EcoLink parses the
:142suffix → instance ID 142 - Validates your API key belongs to the account that owns instance 142
- Looks up instance 142's regional deployments in the instance cache — refreshed every 10s
- Picks the best region (round-robin across healthy pods, failing over on error)
- Forwards the request to
<region-gateway>/<instance_id>/v1/chat/completions - Streams the response back
The response carries x-ecolink-region: mv (or whichever region served it) — useful for debugging latency.
Errors
| HTTP | Meaning |
|---|---|
| 401 | API key invalid or doesn't own the instance |
| 402 | Account balance ≤ $0 OR the instance was terminated for credit depletion |
| 404 | Unknown model / instance ID — check spelling, verify the instance is still running |
| 429 | Rate limited (depends on your plan) |
| 503 | No healthy pod available in any target region — retry |
Latency
- Running instance — latency is network + inference. Typical TTFT for a 7B LLM is 100–300 ms; TPOT depends on GPU count and model size.
- Multi-region instance, one region unhealthy — requests automatically fail over to the healthy region; you may see a brief spike until the router's health check converges.
- Single-region instance, that region unhealthy — no failover possible. This can happen with filesystem-backed models (they're pinned to one region). Redeploy or wait for the region to recover.