Skip to main content

Calling your endpoint

After your inference instance is running, call it the same way you call platform models — just use <model_name>:<instance_id> in the model field.

OpenAI-compatible instances

curl

curl https://api.ecohash.com/v1/chat/completions \
-H "Authorization: Bearer eco_YOUR_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "my-llama:142",
"messages": [
{"role": "user", "content": "Hello from my own model"}
]
}'

Python (OpenAI SDK)

from openai import OpenAI
client = OpenAI(
api_key="eco_YOUR_KEY",
base_url="https://api.ecohash.com/v1",
)
resp = client.chat.completions.create(
model="my-llama:142",
messages=[{"role": "user", "content": "Hello"}],
)
print(resp.choices[0].message.content)

TypeScript

import OpenAI from "openai";
const client = new OpenAI({
apiKey: "eco_YOUR_KEY",
baseURL: "https://api.ecohash.com/v1",
});
const resp = await client.chat.completions.create({
model: "my-llama:142",
messages: [{ role: "user", content: "Hello" }],
});

Streaming

Same as platform models — add "stream": true:

stream = client.chat.completions.create(
model="my-llama:142",
messages=[{"role": "user", "content": "Count to 10"}],
stream=True,
)
for chunk in stream:
if delta := chunk.choices[0].delta.content:
print(delta, end="", flush=True)

Custom (non-OpenAI) instances

For instances launched with OpenAI-compatible unchecked:

curl https://api.ecohash.com/inference-instances/142/proxy/generate \
-H "Authorization: Bearer eco_YOUR_KEY" \
-H "Content-Type: application/json" \
-d '{ "prompt": "Hello", "max_tokens": 100 }'

Replace 142 with your instance ID and generate with your custom_api_path. Request body and response are passed through verbatim.

Sharing across your team

Anyone on your account with an API key can call your instance — it's account-scoped, not user-scoped. Same API URL, same model:instance_id. Team invites work through the Users page.

Using in the playground

OpenAI-compatible instances appear in the console's Playground dropdown (Chat / Image / etc., matching the instance's category):

  • Chat playground — your instance appears alongside meta-llama/Llama-3.1-8B-Instruct etc.
  • Image playground — if it's an image model
  • Embeddings / Reranker — if it's one of those

Select your instance from the dropdown, chat / prompt as usual. Every call bills the same way as an API call.

Custom (non-OpenAI) instances do not appear in the Playground — use curl or your own client to test them.

How routing works

When you call /v1/chat/completions with model: "my-llama:142":

  1. EcoLink parses the :142 suffix → instance ID 142
  2. Validates your API key belongs to the account that owns instance 142
  3. Looks up instance 142's regional deployments in the instance cache — refreshed every 10s
  4. Picks the best region (round-robin across healthy pods, failing over on error)
  5. Forwards the request to <region-gateway>/<instance_id>/v1/chat/completions
  6. Streams the response back

The response carries x-ecolink-region: mv (or whichever region served it) — useful for debugging latency.

Errors

HTTPMeaning
401API key invalid or doesn't own the instance
402Account balance ≤ $0 OR the instance was terminated for credit depletion
404Unknown model / instance ID — check spelling, verify the instance is still running
429Rate limited (depends on your plan)
503No healthy pod available in any target region — retry

Latency

  • Running instance — latency is network + inference. Typical TTFT for a 7B LLM is 100–300 ms; TPOT depends on GPU count and model size.
  • Multi-region instance, one region unhealthy — requests automatically fail over to the healthy region; you may see a brief spike until the router's health check converges.
  • Single-region instance, that region unhealthy — no failover possible. This can happen with filesystem-backed models (they're pinned to one region). Redeploy or wait for the region to recover.