Launching an inference instance

Launching a model as a managed OpenAI-compatible endpoint. The dialog is titled Deploy Inference Endpoint in the console.

Open the dialog

Console → Compute → Model Instances → Deploy Endpoint.

Step 1 — Engine

Pick how the model is served:

Engine	When to use
vLLM	You want the platform vLLM image. Works out of the box for LLM and vision-LLM (VLM) models.
Custom Container	Your own image from the Images registry (or any public Docker image). Required for image / audio / video models, or for any runtime that isn't vLLM (TGI, llama.cpp, Triton, custom servers).

Step 2 — Model source

Three mutually-exclusive options:

Source	What it points at	Replicas / regions
HuggingFace	Any HF repo ID (e.g. `Tongyi-MAI/Z-Image-Turbo`). Pulls at first launch, cached per region.	2 replicas across 2 regions (redundancy)
EcoLink Registry	A model you registered in Registry → Models — typically a folder on a shared filesystem (fine-tunes, custom weights).	1 replica in the filesystem's region (weights are pinned)
EcoLink Models	A built-in platform model (same catalog you see in the Playground).	2 replicas across 2 regions

For HuggingFace: paste the repo ID. A HuggingFace Token field appears — required for gated repos (Llama, Gemma, …), optional otherwise. Your token is stored server-side; it isn't visible to anyone else on your account.

For EcoLink Registry: pick from a dropdown of the models you've registered. The Model Path shown next to it (e.g. /mnt/shared/Z-Image-Turbo-fine-tuning) is the path the container sees at start-up.

Step 3 — Container Image (Custom Container only)

Two ways to specify the image:

Pick from a dropdown of images you've saved in Registry → Images (convenient — same image used across many launches), or
Type a full Docker image URL, e.g. docker.io/vllm/vllm-omni:v0.18.0.

Step 4 — Runtime fields

Field	Notes
Name	Friendly label (optional). Becomes part of the model ID (`<name>:<instance_id>`).
Startup Command	Shell command the container runs. Examples: `vllm serve Tongyi-MAI/Z-Image-Turbo --omni --port 8000` or just `--tensor-parallel-size 2` if your image has an `ENTRYPOINT` that prepends `vllm serve`.
Service Port	Port your container listens on (default `8000`). The platform routes `/v1/*` and `/health` to this port.
OpenAI-compatible	Default on. Uncheck if your container has a custom HTTP surface (then fill in Custom API path). See OpenAI-compatible vs custom.
Custom API path	Required only when OpenAI-compatible is off — e.g. `/generate`.
GPU Type	Shows live free-GPU count per region under each choice. Pick the type that has capacity where you want to run.
GPUs / replica	`1`, `2`, `4`, or `8`. Multi-GPU replicas use tensor parallelism — your startup command may need `--tensor-parallel-size <n>` (vLLM).
Max replicas	Upper bound on total replicas across all target regions. Default 2.

The bottom of the dialog shows Per replica $X/hr and Total ($X/hr × N regions) — this is exactly what you'll be charged (as a 24h credit hold, auto-renewed) while the instance runs.

Step 5 — Deploy

Click Deploy.

What happens after launch

The instance goes to pending.
For each target region, EcoLink creates a pod with your model and container.
For each pod:
- Image is pulled (~1–3 min)
- Model is downloaded (HF-backed, first-time-per-region) or mounted (FS-backed)
- Container starts; platform waits for /health to report ready
Regional status transitions pending → loading → running as pods become Ready.
Instance-level status goes pending → degraded → running as regions come up. degraded means some regions are still loading; the endpoint is already callable (routes to healthy regions only).

Typical time to fully-running: 2–5 minutes for small HF models, 5–15 minutes for 30B–70B HF models, 30–90 seconds for FS-backed.

After it's running — get your endpoint

Go to the instance detail page. You'll see:

Model ID: my-llama:142 (where 142 is the instance ID)
Endpoint URL: https://api.ecohash.com/v1 — same for all user inference instances; differentiation is via the model ID
Proxy URL (if non-OpenAI-compatible): https://api.ecohash.com/inference-instances/142/proxy/<custom_path>

See Calling your endpoint for actual API usage.

Example — deploy Z-Image-Turbo from HuggingFace

Walkthrough for an image-generation model:

Console → Compute → Model Instances → Deploy Endpoint.
Name: Z-Image-Turbo.
Engine: Custom Container.
Model source: HuggingFace → Tongyi-MAI/Z-Image-Turbo. Paste your HF token only if the repo is gated.
Container Image: docker.io/vllm/vllm-omni:v0.18.0 (or pick from your saved Images).
Startup Command: vllm serve Tongyi-MAI/Z-Image-Turbo --omni --port 8000.
GPU Type: pick one with capacity.
Deploy.

Once it shows as running, test it (the instance detail page shows the endpoint URL, e.g. https://45.inference.ecohash.com):

curl https://45.inference.ecohash.com/v1/images/generations \
  -H "Authorization: Bearer eco_YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "A futuristic city with flying cars, high resolution, cinematic lighting",
    "n": 1,
    "size": "512x512"
  }'

From the API

curl https://api.ecohash.com/inference-instances \
  -X POST \
  -H "Authorization: Bearer eco_YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "my-llama-70b",
    "user_model_id": 5,
    "gpu_type": "<gpu-type-for-region>",
    "gpu_count": 2,
    "container_image": "vllm/vllm-openai:latest",
    "startup_command": "--tensor-parallel-size 2",
    "service_port": 8000,
    "openai_compatible": true
  }'

Regions and replica count are derived from the registered model's source — you don't specify them in the request.

Credit hold at launch

The platform places a 24h hold sized for the full multi-region deployment: hourly_rate_per_gpu × gpu_count × total_replicas × 24h. If your balance can't cover it, launch fails with 402. The launch dialog shows the per-GPU rate and total hold for your selection.

See Cost and lifecycle for the full renewal / partial-hold / terminate-at-$0 story.

Startup gotchas

"Model not found" inside the container — check that the container's --model argument matches the model path EcoLink provides at MODEL_PATH env var (HF repo ID or mounted filesystem path).
OOM at load time — the model doesn't fit in the GPU count you picked. Options: use more GPUs with tensor parallelism, or quantize the model (fp8, int8, int4).
Container crashes with missing cuDNN — the container image needs cuDNN installed. Standard vLLM / TGI images come with it. Custom images sometimes miss it.

Open the dialog​

Step 1 — Engine​

Step 2 — Model source​

Step 3 — Container Image (Custom Container only)​

Step 4 — Runtime fields​

Step 5 — Deploy​

What happens after launch​

After it's running — get your endpoint​

Example — deploy Z-Image-Turbo from HuggingFace​

From the API​

Credit hold at launch​

Startup gotchas​

Next steps​