User Inference

User Inference lets you deploy your own model (a HuggingFace repo, a fine-tune on a shared filesystem, or any custom container) as a managed inference endpoint. You get:

A unified API URL at api.ecohash.com — same as platform models
Your own model ID you pass in requests: model: "my-llama:142" where 142 is the instance ID
Built-in redundancy where possible — HuggingFace-backed and platform-model-based instances deploy to 2 replicas across 2 separate regions for regional failover. Shared-filesystem-backed models deploy to 1 replica in the filesystem's region (weights are pinned to one region).
Billing per-GPU-hour with a 24h prepaid hold that renews automatically; runs until your balance hits zero

In other words: "vLLM / TGI / your own server, exposed as an OpenAI-compatible API, with managed infrastructure."

When to use User Inference vs other features

You want…	Use
Call Llama / Gemma / FLUX / Whisper right now	Platform Models
Deploy your fine-tune as a managed API with the same OpenAI-compatible URL + your key	User Inference
A GPU dev box to run experiments and launch scripts	GPU Instance
Run a stateless service on N GPUs behind one URL, no auto-scaling	GPU Cluster

End-to-end flow

Register your model — tell EcoLink where the weights live (HuggingFace repo or a shared filesystem). See Registering models.
Launch an inference instance — specify the model, target regions, GPU type/count, and container image (vLLM, TGI, or custom). See Launching an instance.
Wait for running — EcoLink pulls the image, loads the model, waits for the pod to be Ready. Usually 2–5 minutes for small models, 10–20 minutes for 30B+ models.

Call it — same API URL as platform models, but with your model:instance_id:

POST https://api.ecohash.com/v1/chat/completions
{ "model": "my-llama:142", "messages": [...] }

See Calling your endpoint for the full call-time story, including team members with the same key, playground usage, and failover behavior.

Two deployment modes

OpenAI-compatible container (default, recommended)

The container serves POST /v1/chat/completions (or whatever endpoints it exposes). EcoLink routes requests through its unified API so users call the standard api.ecohash.com/v1/chat/completions URL. Examples: vLLM, TGI, LMDeploy, SGLang, llama.cpp's OpenAI server mode.

Custom / non-OpenAI container

If your container speaks a different protocol (your own REST API, gRPC, bespoke schema), toggle the "Container is OpenAI-compatible" checkbox off at launch time and specify the custom_api_path (e.g., /generate). You get a proxy URL instead:

https://api.ecohash.com/inference-instances/<id>/proxy/<custom_api_path>

See OpenAI-compatible vs custom for the tradeoffs.

Cost and lifecycle

No upfront commitment. The 24h hold is the prepaid unit; if you stop the instance in 1 hour you get ~23h refunded.
Auto-renews every 24h until balance hits $0. At $0, the instance is stopped and any running pods killed. See Cost and lifecycle.

When to use User Inference vs other features​

End-to-end flow​

Two deployment modes​

OpenAI-compatible container (default, recommended)​

Custom / non-OpenAI container​

Cost and lifecycle​

Next steps​