Skip to main content

Launching an inference instance

Launching a model as a managed OpenAI-compatible endpoint. The dialog is titled Deploy Inference Endpoint in the console.

Open the dialog

Console → Compute → Model Instances → Deploy Endpoint.

Step 1 — Engine

Pick how the model is served:

EngineWhen to use
vLLMYou want the platform vLLM image. Works out of the box for LLM and vision-LLM (VLM) models.
Custom ContainerYour own image from the Images registry (or any public Docker image). Required for image / audio / video models, or for any runtime that isn't vLLM (TGI, llama.cpp, Triton, custom servers).

Step 2 — Model source

Three mutually-exclusive options:

SourceWhat it points atReplicas / regions
HuggingFaceAny HF repo ID (e.g. Tongyi-MAI/Z-Image-Turbo). Pulls at first launch, cached per region.2 replicas across 2 regions (redundancy)
EcoLink RegistryA model you registered in Registry → Models — typically a folder on a shared filesystem (fine-tunes, custom weights).1 replica in the filesystem's region (weights are pinned)
EcoLink ModelsA built-in platform model (same catalog you see in the Playground).2 replicas across 2 regions

For HuggingFace: paste the repo ID. A HuggingFace Token field appears — required for gated repos (Llama, Gemma, …), optional otherwise. Your token is stored server-side; it isn't visible to anyone else on your account.

For EcoLink Registry: pick from a dropdown of the models you've registered. The Model Path shown next to it (e.g. /mnt/shared/Z-Image-Turbo-fine-tuning) is the path the container sees at start-up.

Step 3 — Container Image (Custom Container only)

Two ways to specify the image:

  • Pick from a dropdown of images you've saved in Registry → Images (convenient — same image used across many launches), or
  • Type a full Docker image URL, e.g. docker.io/vllm/vllm-omni:v0.18.0.

Step 4 — Runtime fields

FieldNotes
NameFriendly label (optional). Becomes part of the model ID (<name>:<instance_id>).
Startup CommandShell command the container runs. Examples: vllm serve Tongyi-MAI/Z-Image-Turbo --omni --port 8000 or just --tensor-parallel-size 2 if your image has an ENTRYPOINT that prepends vllm serve.
Service PortPort your container listens on (default 8000). The platform routes /v1/* and /health to this port.
OpenAI-compatibleDefault on. Uncheck if your container has a custom HTTP surface (then fill in Custom API path). See OpenAI-compatible vs custom.
Custom API pathRequired only when OpenAI-compatible is off — e.g. /generate.
GPU TypeShows live free-GPU count per region under each choice. Pick the type that has capacity where you want to run.
GPUs / replica1, 2, 4, or 8. Multi-GPU replicas use tensor parallelism — your startup command may need --tensor-parallel-size <n> (vLLM).
Max replicasUpper bound on total replicas across all target regions. Default 2.

The bottom of the dialog shows Per replica $X/hr and Total ($X/hr × N regions) — this is exactly what you'll be charged (as a 24h credit hold, auto-renewed) while the instance runs.

Step 5 — Deploy

Click Deploy.

What happens after launch

  1. The instance goes to pending.
  2. For each target region, EcoLink creates a pod with your model and container.
  3. For each pod:
    • Image is pulled (~1–3 min)
    • Model is downloaded (HF-backed, first-time-per-region) or mounted (FS-backed)
    • Container starts; platform waits for /health to report ready
  4. Regional status transitions pending → loading → running as pods become Ready.
  5. Instance-level status goes pending → degraded → running as regions come up. degraded means some regions are still loading; the endpoint is already callable (routes to healthy regions only).

Typical time to fully-running: 2–5 minutes for small HF models, 5–15 minutes for 30B–70B HF models, 30–90 seconds for FS-backed.

After it's running — get your endpoint

Go to the instance detail page. You'll see:

  • Model ID: my-llama:142 (where 142 is the instance ID)
  • Endpoint URL: https://api.ecohash.com/v1 — same for all user inference instances; differentiation is via the model ID
  • Proxy URL (if non-OpenAI-compatible): https://api.ecohash.com/inference-instances/142/proxy/<custom_path>

See Calling your endpoint for actual API usage.

Example — deploy Z-Image-Turbo from HuggingFace

Walkthrough for an image-generation model:

  1. Console → Compute → Model Instances → Deploy Endpoint.
  2. Name: Z-Image-Turbo.
  3. Engine: Custom Container.
  4. Model source: HuggingFace → Tongyi-MAI/Z-Image-Turbo. Paste your HF token only if the repo is gated.
  5. Container Image: docker.io/vllm/vllm-omni:v0.18.0 (or pick from your saved Images).
  6. Startup Command: vllm serve Tongyi-MAI/Z-Image-Turbo --omni --port 8000.
  7. GPU Type: pick one with capacity.
  8. Deploy.

Once it shows as running, test it (the instance detail page shows the endpoint URL, e.g. https://45.inference.ecohash.com):

curl https://45.inference.ecohash.com/v1/images/generations \
-H "Authorization: Bearer eco_YOUR_KEY" \
-H "Content-Type: application/json" \
-d '{
"prompt": "A futuristic city with flying cars, high resolution, cinematic lighting",
"n": 1,
"size": "512x512"
}'

From the API

curl https://api.ecohash.com/inference-instances \
-X POST \
-H "Authorization: Bearer eco_YOUR_KEY" \
-H "Content-Type: application/json" \
-d '{
"name": "my-llama-70b",
"user_model_id": 5,
"gpu_type": "<gpu-type-for-region>",
"gpu_count": 2,
"container_image": "vllm/vllm-openai:latest",
"startup_command": "--tensor-parallel-size 2",
"service_port": 8000,
"openai_compatible": true
}'

Regions and replica count are derived from the registered model's source — you don't specify them in the request.

Credit hold at launch

The platform places a 24h hold sized for the full multi-region deployment: hourly_rate_per_gpu × gpu_count × total_replicas × 24h. If your balance can't cover it, launch fails with 402. The launch dialog shows the per-GPU rate and total hold for your selection.

See Cost and lifecycle for the full renewal / partial-hold / terminate-at-$0 story.

Startup gotchas

  • "Model not found" inside the container — check that the container's --model argument matches the model path EcoLink provides at MODEL_PATH env var (HF repo ID or mounted filesystem path).
  • OOM at load time — the model doesn't fit in the GPU count you picked. Options: use more GPUs with tensor parallelism, or quantize the model (fp8, int8, int4).
  • Container crashes with missing cuDNN — the container image needs cuDNN installed. Standard vLLM / TGI images come with it. Custom images sometimes miss it.

Next steps