Launching an inference instance
Launching a model as a managed OpenAI-compatible endpoint. The dialog is titled Deploy Inference Endpoint in the console.
Open the dialog
Console → Compute → Model Instances → Deploy Endpoint.
Step 1 — Engine
Pick how the model is served:
| Engine | When to use |
|---|---|
| vLLM | You want the platform vLLM image. Works out of the box for LLM and vision-LLM (VLM) models. |
| Custom Container | Your own image from the Images registry (or any public Docker image). Required for image / audio / video models, or for any runtime that isn't vLLM (TGI, llama.cpp, Triton, custom servers). |
Step 2 — Model source
Three mutually-exclusive options:
| Source | What it points at | Replicas / regions |
|---|---|---|
| HuggingFace | Any HF repo ID (e.g. Tongyi-MAI/Z-Image-Turbo). Pulls at first launch, cached per region. | 2 replicas across 2 regions (redundancy) |
| EcoLink Registry | A model you registered in Registry → Models — typically a folder on a shared filesystem (fine-tunes, custom weights). | 1 replica in the filesystem's region (weights are pinned) |
| EcoLink Models | A built-in platform model (same catalog you see in the Playground). | 2 replicas across 2 regions |
For HuggingFace: paste the repo ID. A HuggingFace Token field appears — required for gated repos (Llama, Gemma, …), optional otherwise. Your token is stored server-side; it isn't visible to anyone else on your account.
For EcoLink Registry: pick from a dropdown of the models you've registered. The Model Path shown next to it (e.g. /mnt/shared/Z-Image-Turbo-fine-tuning) is the path the container sees at start-up.
Step 3 — Container Image (Custom Container only)
Two ways to specify the image:
- Pick from a dropdown of images you've saved in Registry → Images (convenient — same image used across many launches), or
- Type a full Docker image URL, e.g.
docker.io/vllm/vllm-omni:v0.18.0.
Step 4 — Runtime fields
| Field | Notes |
|---|---|
| Name | Friendly label (optional). Becomes part of the model ID (<name>:<instance_id>). |
| Startup Command | Shell command the container runs. Examples: vllm serve Tongyi-MAI/Z-Image-Turbo --omni --port 8000 or just --tensor-parallel-size 2 if your image has an ENTRYPOINT that prepends vllm serve. |
| Service Port | Port your container listens on (default 8000). The platform routes /v1/* and /health to this port. |
| OpenAI-compatible | Default on. Uncheck if your container has a custom HTTP surface (then fill in Custom API path). See OpenAI-compatible vs custom. |
| Custom API path | Required only when OpenAI-compatible is off — e.g. /generate. |
| GPU Type | Shows live free-GPU count per region under each choice. Pick the type that has capacity where you want to run. |
| GPUs / replica | 1, 2, 4, or 8. Multi-GPU replicas use tensor parallelism — your startup command may need --tensor-parallel-size <n> (vLLM). |
| Max replicas | Upper bound on total replicas across all target regions. Default 2. |
The bottom of the dialog shows Per replica $X/hr and Total ($X/hr × N regions) — this is exactly what you'll be charged (as a 24h credit hold, auto-renewed) while the instance runs.
Step 5 — Deploy
Click Deploy.
What happens after launch
- The instance goes to
pending. - For each target region, EcoLink creates a pod with your model and container.
- For each pod:
- Image is pulled (~1–3 min)
- Model is downloaded (HF-backed, first-time-per-region) or mounted (FS-backed)
- Container starts; platform waits for
/healthto report ready
- Regional status transitions
pending → loading → runningas pods become Ready. - Instance-level status goes
pending → degraded → runningas regions come up.degradedmeans some regions are still loading; the endpoint is already callable (routes to healthy regions only).
Typical time to fully-running: 2–5 minutes for small HF models, 5–15 minutes for 30B–70B HF models, 30–90 seconds for FS-backed.
After it's running — get your endpoint
Go to the instance detail page. You'll see:
- Model ID:
my-llama:142(where142is the instance ID) - Endpoint URL:
https://api.ecohash.com/v1— same for all user inference instances; differentiation is via the model ID - Proxy URL (if non-OpenAI-compatible):
https://api.ecohash.com/inference-instances/142/proxy/<custom_path>
See Calling your endpoint for actual API usage.
Example — deploy Z-Image-Turbo from HuggingFace
Walkthrough for an image-generation model:
- Console → Compute → Model Instances → Deploy Endpoint.
- Name:
Z-Image-Turbo. - Engine: Custom Container.
- Model source: HuggingFace →
Tongyi-MAI/Z-Image-Turbo. Paste your HF token only if the repo is gated. - Container Image:
docker.io/vllm/vllm-omni:v0.18.0(or pick from your saved Images). - Startup Command:
vllm serve Tongyi-MAI/Z-Image-Turbo --omni --port 8000. - GPU Type: pick one with capacity.
- Deploy.
Once it shows as running, test it (the instance detail page shows the endpoint URL, e.g. https://45.inference.ecohash.com):
curl https://45.inference.ecohash.com/v1/images/generations \
-H "Authorization: Bearer eco_YOUR_KEY" \
-H "Content-Type: application/json" \
-d '{
"prompt": "A futuristic city with flying cars, high resolution, cinematic lighting",
"n": 1,
"size": "512x512"
}'
From the API
curl https://api.ecohash.com/inference-instances \
-X POST \
-H "Authorization: Bearer eco_YOUR_KEY" \
-H "Content-Type: application/json" \
-d '{
"name": "my-llama-70b",
"user_model_id": 5,
"gpu_type": "<gpu-type-for-region>",
"gpu_count": 2,
"container_image": "vllm/vllm-openai:latest",
"startup_command": "--tensor-parallel-size 2",
"service_port": 8000,
"openai_compatible": true
}'
Regions and replica count are derived from the registered model's source — you don't specify them in the request.
Credit hold at launch
The platform places a 24h hold sized for the full multi-region deployment: hourly_rate_per_gpu × gpu_count × total_replicas × 24h. If your balance can't cover it, launch fails with 402. The launch dialog shows the per-GPU rate and total hold for your selection.
See Cost and lifecycle for the full renewal / partial-hold / terminate-at-$0 story.
Startup gotchas
- "Model not found" inside the container — check that the container's
--modelargument matches the model path EcoLink provides atMODEL_PATHenv var (HF repo ID or mounted filesystem path). - OOM at load time — the model doesn't fit in the GPU count you picked. Options: use more GPUs with tensor parallelism, or quantize the model (fp8, int8, int4).
- Container crashes with missing cuDNN — the container image needs cuDNN installed. Standard vLLM / TGI images come with it. Custom images sometimes miss it.