User Inference
User Inference lets you deploy your own model (a HuggingFace repo, a fine-tune on a shared filesystem, or any custom container) as a managed inference endpoint. You get:
- A unified API URL at
api.ecohash.com— same as platform models - Your own model ID you pass in requests:
model: "my-llama:142"where142is the instance ID - Built-in redundancy where possible — HuggingFace-backed and platform-model-based instances deploy to 2 replicas across 2 separate regions for regional failover. Shared-filesystem-backed models deploy to 1 replica in the filesystem's region (weights are pinned to one region).
- Billing per-GPU-hour with a 24h prepaid hold that renews automatically; runs until your balance hits zero
In other words: "vLLM / TGI / your own server, exposed as an OpenAI-compatible API, with managed infrastructure."
When to use User Inference vs other features
| You want… | Use |
|---|---|
| Call Llama / Gemma / FLUX / Whisper right now | Platform Models |
| Deploy your fine-tune as a managed API with the same OpenAI-compatible URL + your key | User Inference |
| A GPU dev box to run experiments and launch scripts | GPU Instance |
| Run a stateless service on N GPUs behind one URL, no auto-scaling | GPU Cluster |
End-to-end flow
- Register your model — tell EcoLink where the weights live (HuggingFace repo or a shared filesystem). See Registering models.
- Launch an inference instance — specify the model, target regions, GPU type/count, and container image (vLLM, TGI, or custom). See Launching an instance.
- Wait for
running— EcoLink pulls the image, loads the model, waits for the pod to be Ready. Usually 2–5 minutes for small models, 10–20 minutes for 30B+ models. - Call it — same API URL as platform models, but with your
model:instance_id:POST https://api.ecohash.com/v1/chat/completions
{ "model": "my-llama:142", "messages": [...] }
See Calling your endpoint for the full call-time story, including team members with the same key, playground usage, and failover behavior.
Two deployment modes
OpenAI-compatible container (default, recommended)
The container serves POST /v1/chat/completions (or whatever endpoints it exposes). EcoLink routes requests through its unified API so users call the standard api.ecohash.com/v1/chat/completions URL. Examples: vLLM, TGI, LMDeploy, SGLang, llama.cpp's OpenAI server mode.
Custom / non-OpenAI container
If your container speaks a different protocol (your own REST API, gRPC, bespoke schema), toggle the "Container is OpenAI-compatible" checkbox off at launch time and specify the custom_api_path (e.g., /generate). You get a proxy URL instead:
https://api.ecohash.com/inference-instances/<id>/proxy/<custom_api_path>
See OpenAI-compatible vs custom for the tradeoffs.
Cost and lifecycle
- No upfront commitment. The 24h hold is the prepaid unit; if you stop the instance in 1 hour you get ~23h refunded.
- Auto-renews every 24h until balance hits $0. At $0, the instance is stopped and any running pods killed. See Cost and lifecycle.